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Introduction 


Introductions 


Expectations 


This class must be practical. 

We are *very* interested in constructive criticism and 
suggestions. As much as possible, put your comments in 
writing on the evaluation that was page 0. 


This class contains sensitive information - please be *very* 
careful who you share it with. 


Please work in pairs, and work on the same machine all week long. 
- if you trash your disk, you need to fix it 


- we will be doing detailed work, which goes 
faster with two people 


Expect to work hard. This is not a class for the faint-of-heart 
or people that want to be spoon-fed. 


Realize that your instructor doesn’t know everything - if he 
doesn’t say, "I don’t know" from time to time, get suspicious :-) 


HP-UX source will not be a part of the class. 


Feel free to ask questions, but please defer them until lab 
time if appropriate. 


"I hear and I forget. I see and I remember. I do and 
I understand." 


Overview of the Class 


Background of HP-UX 


The "Big Picture" of the Kernel 
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HP-UX Origins and Compatibility 


Original UNIX(tm) came from Bell Labs in the late 1960s. 


Over time it was refined, and AT&T released version 7. It 
has been said that V7 was better than either its predecessors 
or its successors. 


AT&T released System III and System V, and System V has become 
a standard that many people accept. 


UC Berkeley took V7 or something similar and started going 
another way. They have since released 4.1-4.3, and BSD 
is a standard that another set of people accept. 


HP-UX on both the S300 and S800 is a port of BSD4.2, witha 
System V call interface on top of it. It passes the SVID 

(for V.2 as of May 1988), but has many of the smart things that 
Berkeley did (demand-paged VM, HFS filesystem, etc). 


In 8.0, there is a totally different VM system, based largely 
on System V.3. 


HP-UX Structure Overview 
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The "Big Picture" 


- What is 
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of the Kernel 


it there for? 
manage resources 


make life easier for the programmer 


- What are the major components? 


kernel processes: swapper, pager, [init], [CSPs]... 
device drivers 
privileged library routines that deal with: 

- processes 

- memory 

- the file system 

- the I/O system 

- Giskless nodes 
Timeout routines - not really a process, but they act 
like it in the sense that they are responsible for 
monitoring free memory, CPU scheduling, etc. If they 
were part of a process, it would be process 0, but they 
operate independently of it. 
(These can be thought of as "internal at(1) jobs". 
Inside the kernel, one can call a routine named timeout () 


and tell it to call a particular function N clock ticks 
from now.) - 


SE 390: Series 300 HP-UX Internals 


Introduction 
The Kernel in One Page :-) 


- PROCESSES are running programs; they have their own private 
address space, they (hopefully) get to use the CPU from time 
to time, and the kernel keeps information about them in 
structures called the "u area" and "proc table entry". 


- The I/O system is largely composed of device drivers, each 
of which specializes ina particular kind of device interface. 
There are also general principles of how interrupt-driven 
devices talk to the system and how we decide which driver 
should be called for a given task. 


- The FILESYSTEM is responsible for organizing non-volatile 
data on the disk. HP-UX uses the Berkeley filesystem, 
which can be thought of as many small Bell filesystems 
stuck together on the disk. The filesystem also has 
provisions in-core to handle other kinds of filesystems, 
such as NFS or CDFS. It does this through an abstraction 
called a "vnode". 


- MEMORY is managed by the kernel in such a way that each 
process gets some private address space, and the sum of 
the amounts of memory used by each process can be much 
greater than the amount of RAM in the machine. 


torn nn rr rr rr rr er rr rr rr rere + 
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-------------- | HARDWARE | --------------- 
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VM 
LAN Diskless 
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Access to the Kernel 
- System calls. 
- front ends in libc 
- 68K 
- move system call number into dO (680x0 register) 
- change modes with "trap 0", which kernel catches 


- trap handler calls syscall () 


- each process has something called a "gateway" 
page mapped into its address space 


- in this page there is a "gate" instruction, 
which "promotes" the privilege level of the 
process and calls the kernel routine syscall () 


- actual system call code is called indirectly, using 
the system call number as an index into sysent [] 


- The assembly-level debugger, adb(1). 

- The kernel debugger, SYSDEBUG (68K) or DDB (PA). This is most 
useful for people in the lab’s kernel group or people writing 
drivers - not very useful without source (and DDB requires 
a 300 or 400 to run on - it is not a standalone debugger) 

- Calls to nlist(3) & /dev/kmem 

- YOU ARE ON YOUR OWN 


- call nlist(3) to get address of symbol from "a.out" 
file (/hp-ux in this case) 


- open /dev/kmem and seek to address 
- read information 


- YOU ARE ON YOUR OWN - KERNEL DATA STRUCTURES 
CHANGE FROM RELEASE TO RELEASE! 
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/* @(#) $Revision: 1.8.62.17 $ */ | 
#ifndef MSYS SPACE INCLUDED /* allows multiple inclusion */ 
#define _MSYS SPACE INCLUDED 


#include "../ufs/fsdir.h" 


#include "../h/user.h" 
#include "../h/proc.h" 
#include "../h/sem_beta.h" 


#include "../h/vnode.h" 
#include "../ufs/inode.h" 


#include "../cdfs/cdfsdir.h" 

#include "../cdfs/cdnode.h" 

#include "../cdfs/cdfs.h" 

#ifdef SIXR 

#include "../machine/sna_space.h" /* for SNAP */ 
#tendif 


#include "../h/callout.h" 
#include "../h/kernel.h" 
#include "../h/map.h" 
#include "../h/buf.h" 
#include "../h/pty.h" 
#include "../h/nvs.h" 


#include "../machine/iobuf.h"; 
#include "../machine/dilio.h"; 
#include "../dux/rmswap.h" 


#include "../dux/dm.h" 
#include "../dux/protocol.h" 
#include "../dux/nsp.h" 


#include "../machine/lnatypes.h" 


#include "../machine/intrpt.h" 
#include "../machine/hpibio.h" 
#include "../machine/drvhw.h" 


#include "../h/devices.h" 
#include "../h/dnic.h" 
#include "../h/file.h" 


/* 


* System parameter formulae. 
+] 


struct timezone tz = { TIMEZONE, DST }; 


short rootlink[3] = { Oxffff, Oxffff, Oxfffft }; 
char *bootlink = 0; 
int lanselectcode = -1; 


int num_cnodes = NUM_CNODES; 


3 ; 
xk Size the using/serving arrays. USING_ARRAY SIZE and SERVING_ARRAY SIZE 
** are configurable parameters. 
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ey 
int using _array_size = USING ARRAY SIZE; 
struct using _entry using _array[ USING. ARRAY | SIZE ]; 


int serving_array_size = (SERVING_ARRAY SIZE > MAX SERVING ARRAY) ? MAX SERV 
struct serving_entry serving _array [ (SERVING . ARRAY | SIZE > MAX _ SERVING ARRAY) 


int dskless_fsbufs = (DSKLESS_FSBUFS > MAX SERVING _ARRAY) ? MAX SERVING _ARRA 


/* 
** Define timeout periods for selftest and crash detection. SELFTEST_PERIOD 
** SEND ALIVE PERIOD and CHECK_ALIVE PERIOD are configurable parameters. 


*/ 


/* If selftest period is 0 then no selftest, otherwise lowerbound of 90 secs 
int selftest_period = ((SELFTEST_PERIOD == 0) ? SELFTEST_PERIOD : ((SELFTEST 


CHECK ALIVE PERIOD; 


int check_alive_period 2 = 
RETRY ALIVE PERIOD; 


int retry_alive_period 


int ngcsp = NGCSP; | 

int ncsp = NGCSP + 1; /* always one for limited CSP */ 
struct nsp nsp[NGCSP+1] ; /* always one for limited CSP */ 
struct nsp *nspNCSP = &nsp[NGCSP+1]; 


/* semaphore to prevent regular LAN init to reinitialize the network. */ 
/* USEFUL ??? */ 


int DUX_init = 1; 
/* dskless subsystem initialization flag */ 
int dskless initialized = 0; 


#ifdef UIPC /* UIPC is the umbrella subsystem for networking */ 
/* 


* Networking 
*/ 


#include "../h/mbuf.h" 
#define PRUREQUESTS . 
#include "../h/protosw.h" 
#include "../h/socket.h" 


#ifdef INET 
#include "../net/if.h" 


#include ee haa eae 
#include "../net/raw cb.h" 
#include "../netinet/in.h" 
#include "../netinet/if ether.h" 
#include "../h/mib.h" 

#include "../netinet/mib_kern.h" 
#include "../net/if_ni.h" 

/* ni */ 


‘nt ni_max = NNI; 
struct ni_cb ni_cb[NNTI] ; 
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I 
pecan Domain 
* 
#define TCPSTATES 
#include "../netinet/tcp fsm.h" 
struct ifqueue ipintrd; 


/* 
* (X)NS Domain 
* 
struct ifqueue nsintrdq; 


#Hendif /* INET */ 
#Hendif /* UIPC */ 


* 


* Netisr 

* / . 
int netisr_ priority = NETISR_PRIORITY; 
int netmemmax = NETMEMMAX; 


#ifdef NSDIAG 

#include ",./sio/nsdiag0O.h" 

#define NSDIAG MAX QUEUE 500 

int nsdiagO high_water = NSDIAG MAX QUEUE; 

usdiag_event_msg type *nsdiagO_msg_queue; /* msg queue */ 
#tendif /* NSDIAG */ 


#ifdef LANO1 
#include "../sio/lanc.h" 
#include "../machine/drvhw_ift.h" 


#if ((NUM_LAN_CARDS > 0) && (MAX LAN CARDS > NUM_LAN CARDS) ) 
int num_lan_cards = NUM_LAN CARDS; 


#else 
#if (NUM_LAN CARDS > MAX LAN CARDS) 
int num_lan_cards = MAX LAN CARDS; /* exceed MAX LAN CARDS */ 


Helse /* We force it to defatul */ 

int num_lan_cards = 2; 

#tendif fk NUM _ LAN _ CARDS > MAX LAN _ CARDS * / 
#tendif 


lan_ift * lan_dio at ta 
#endif /* LANO1 */ 


* 
* Streams subsystem 


= / 
#ifdef HPSTREAMS 
‘nt strmsgsz STRMSGSZ; 


int strctlsz STRCTLSZ; 
int nstrevent = NSTREVENT; 


Ee: 
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‘nt nstrpush = NSTRPUSH; 


#include "../streams/str_hpux.h" 
#include "../streams/str_stream.h" 


#endif /* HPSTREAMS */ 


#define NETSLOP 20 


#ifdef NOSWAP 
#define NOSWAP 1 


#else 

#define NOSWAP 0 

#endif 

#define NCLIST (100+16*MAXUSERS) 
int nclist = NCLIST; 

int nproc = NPROC; 

int ninode = NINODE; 

/* 


* maxfiles is the system default soft limit for the maximum number of 

* open files per process. maxfiles defaults to 60 if not configured. 

* maxfiles_ lim is the system default hard limit for the maximum number of 
oo files per process. maxfiles_ lim defaults to 1024 if not configured. 
* 

int maxfiles = MAXFILES; 

int maxfiles lim = MAXFILES LIM; 


/*The NCDNODE should be defined in master for configurability. Before we 
can actually do it, this is what we can do now.*/ 
#define NCDNODE 150 


int ncdnode = NCDNODE; 
int ncallout = NCALLOUT; 
long unlockable _mem = UNLOCKABLE MEM; 
int nfile = NFILE + FILE PAD; 
int file pad = FILE PAD; 
int nbuf = NBUF; 
int nflocks = NFLOCKS; 
int npty = NPTY; 
int ndilbuffers = NDILBUFFERS; 
int ncsize = NINODE; 
struct ncache ncache [NINODE] ; 
/* 
* 


Hash table of open devices. 
* 


dtaddr_t devhash [DEVHSZ] ; 


int maxuprc = MAXUPRC; 

int maxdsiz = MAXDSIZ/NBPG; /* unit: page size */ 
int maxssiz = MAXSSIZ/NBPG; /* unit: page size */ 
int maxtsiz = MAXTSIZ/NBPG; /* unit: page size */ 
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int parity option = PARITY OPTION; 
int reboot_option = REBOOT_OPTION; 
int noswap = NOSWAP; 
int install = NOSWAP; 
int timeslice = TIMESLICE; /* unit: 20ms tick */ 
int acctsuspend = ACCTSUSPEND; /* unit: percent of filesystem free 
int acctresume = ACCTRESUME; /* unit: percent of filesystem free 
int dos_mem_byte = DOS MEM BYTE; /* mem. reserved for dos in bytes 
int mem_no = 3; /* major device number of memory special file */ 
int ieee802 no = 18; . 
int ethernet_no = 19; 
uint dos_mem_ start; /* physical addr. of dos mem. */ 
int scroll_lines = SCROLL_LINES; /* number of lines of ITE buffer */ 
/* 

The tty stuff that needs to be declared somewhere. 
* / ; 
#define NPCI 16 


short npci = NPCI; 
struct tty *tty_line[NPCI]; 
struct tty *cons tty; 


* 


* These have to be allocated somewhere; allocating 
* them here forces loader errors if this file is omitted. 


struct proc *proc, *procNPROC, *cur_proc; 
struct inode *inode, *inodeNINODE; 

struct callout *callout; 
Struct file *file, *fileNFILE, *file reserve; 
struct locklist locklist [NFLOCKS] ; /* The lock table itself */ 
struct tty pt_tty([NPTY]; 

struct tty *pt_line[NPTY]; 

struct pty_info pty_info[NPTY] ; 

struct nvsj nvsj [NPTY] ; 

struct buf dil_bufs [NDILBUFFERS] ; 

struct iobuf dil_iobufs [NDILBUFFERS] ; 

Struct dil_info dil_info[NDILBUFFERS] ; 

int (*fhs_timeout_proc) () = NULL; 


/* declarations for stub routines for non-configurable portions of EISA bus 
extern nop(); 


int (*eisa_init_routine) () = nop; 
int (*eisa_nmi_routine) () = nop; 
int (*eisa_eoi_routine) () = nop; 


/* declarations for stub routines for non-configurable portions of MTV (VME) 
int (*vme_init_ routine) () = nop; 


/* 

** The following supports savecore on the s300 

ty 

Long dumplo; /* offset into dumpdev */ 

int dumpsize; /* amount of NBPG phys mem to save - dep on swap */ 


(2. 
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int 


dumpmag ; 


/* magic number for savecore, 0x8fca0101 */ 


/* dumpdev is now generated into conf.c by config */ 


heads of available lists */ 
head of free swap header list */ 


scheduling flag */ 
scheduling flag */ 
scheduling flag */ 
more scheduling */ 
more scheduling */ 
actual max memory per process */ 


physical memory on this CPU */ 
current index into coremap used b 


* The following is for the shared memory subsystem (if configured) 


struct cbhlock *cfree; 
struct buf *buf, *swbuf; 
short *swsize; 
int *swp£; 
char *buffers; 
struct bufqhead bfreelist [BQUEUES] ; 
struct buf bswlist; 
char runin; 
char runout ; 
int runrun; 
#ifdef RTPRIO 
u_char curpri; 
#else /* RTPRIO */ 
char Curpri:; 
#endif /* RTPRIO */ 
int maxmem; 
int physmem; 
int hand; 
int wantin; 
int selwait; 
/* 
*/ 
#if MESG== 
#include Leer 20) Ass ol ove ot. 
#include ",./h/msg.h" 
struct ipcmap msgmap [MSGMAP] ; 
struct msqid_ds msgque [MSGMNT] ; 
struct msg msgh [MSGTQL] ; 
struct msginfo msginfo 
MSGMAP, 
MSGMAX, 
MSGMNB, 
MSGMNI, 
MSGSSZ, 
MSGTOQOL, 
MSGSEG 
int messages present = 1; 
#else 
int messages present = 0; 
#Hendif 


#if SEMA==1 


# 
+ 
# 


ifndef IPC _ALLOC 


include "../h/ipc.h" 


endif 


#include 


",./h/sem.h" 


i> 


Nov 04 10:30 1992 edited 9.0 space.h Page 7 


“struct semid_ds sema [SEMMNI] ; 


struct sem sem [SEMMNS] ; 
struct map semmap [SEMMAP] ; 
struct sem_undo *sem_undo [NPROC] ; 
#define SEMUSZ (sizeof(struct sem_undo)+sizeof (struct undo) *SEMUME) 
int semu [ ( (SEMUSZ*SEMMNU) +NBPW-1) /NBPW] ; 
union { 
short semvals [SEMMSL] ; 


struct semid_ds ds; 
struct sembuf semops [SEMOPM] ; 


} semtmp; 
struct seminfo seminfo = { 
SEMMAP, 
SEMMNI, 
SEMMNS, 
SEMMNU, 
SEMMSL, 
SEMOPM, 
SEMUME , 
SEMUSZ, 
_ SEMVMX, 
, SEMAEM 
int semaphores present = 1; 
#else 
int semaphores present = 0; 
#endif 
#if SHMEM == 
# ifndef IPC ALLOC 
# include "../h/ipc.h" 
tt | endif 
#include "../h/shm.h" 
struct shmid_ds shmem [SHMMNT] ; 
struct shminfo shminfo = 
SHMMAX, 
SHMMIN, 
SHMMNI, 
SHMSEG 
int shared_memory_present = 1; 
#else 
# ifndef IPC _ALLOC 
+ include "../h/ipc.h" 
# endif 
#include ",./h/shm.h" 
struct shmid_ds shmem[1]; 
int shared_memory_present = 0; 
#endif 
/* The parser is currently not configurable, but when it is, modify the 
* assignment of (*pn_getcomponent) () = to your choice of parser. 
ee now its pn_getcomponent_n_computer() (8bit). 
* 


/* two-byte characters in file names. */ 


4 
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‘* extern int pn_getcomponent chinese t(); not supported yet */ 
extern int pn_getcomponent_n computer (); 

#ifndef PARSER 

#define PARSER pn_getcomponent_n computer 

#Hendif 

int (*pn_getcomponent) () = PARSER; 


struct pidchunk 
int start; 
int end; 

} mypidchunks [NPROC] ; 


/* The following are configuration flags for networking */ 


int relinsc_1 flag = 1; 

int relinsc_ 2 _flag => 

int relinsc 3 flag = 1; 

Int Swapspc_ cnt; /* pages of available swap space */ 

int swapmem_max; /* total pages of system available swap space */ 
int swapmem_ cnt; /* pages of available memory for "swap" */ 

int maxfs pri; /* highest available device priority */ 

int maxdev_pri; /* highest available swap prioirity*/ 

int sys_mem; /* pages of memory not available for "swap" */ 


int minswapchunks = MINSWAPCHUNKS; 


#tifdef X25 

#i£f (defined (NUM_PDNO) && (NUM_PDNO >= 0)) 
#ifndef IPPROTO ICMP 

#include "../netinet/in.h" 
#endif /* NOT IPPROTO_ICMP * / 
#ifndef IFF_UP 

H#include "../net/if.h" 

#endif /* IFF_UP not defined */ 
#include "../x25/x25gen.h" 
#endif /* NUM_PDNO */ 

#endif /* X25 */ 


/* 

* Double Stuff data structures/configuration; a -1 value means that the 
* parameter will be calculated from available memory at boot time. 

/, 


#define VHNDFRAC -1 
#define MAXPMEM -1 


#include "../h/sysinfo.h" 
#include "../h/pfdat.h" 
#include "../h/swap.h" 


int desperate; 

struct minfo minfo; 

struct pfdat **phash; 

struct pfdat *pfdat; 

int phashmask; /* Page hash mask */ 
struct pfdat phead; 


ID 
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.ong phread, phwrite; 
int swchunk = SWCHUNK; 


int nswapfs = NSWAPFS; 
struct fswdevt fswdevt [NSWAPFS] ; 


int nswapdev = NSWAPDEV; 
struct swap_stats swap_stats [NSWAPDEV+NSWAPFS+1] ; 
int swapmem_on = SWAPMEM_ ON; 


int sysmem_max = SYSMEMMAX; 
int maxswapchunks = MAXSWAPCHUNKS; 


struct devpri swdev_pri[NSWPRI] ; 
struct fspri swfs_pri[NSWPRI] ; 


struct swaptab swaptab [MAXSWAPCHUNKS] ; 


vm_sema_t swap_lock; 

int nextswap; 

int sSwapwant; 

int mpid; /* For generating unique process IDs */ 


#include "../h/var.h" 
struct var v = 
VHNDFRAC, 


; 
int ticks _since_boot; 


* 


* Variables used for sar 
oy 

#include "../h/sar.h" 
long sar_swapin; 

long sar_swapout; 

long sar_bswapin; 

long sar_bswapout; 

struct syswait syswait; 


int procovf = 0; ' | 

int istackptr = 0; /* True if running on istack */ 
int freemem_cnt = 0; 

#ifdef GENESIS 


/* Set by graphics make_entry(), used in main() to decide whether or */ 
/* not to start vdmad. * / 


int vdma_present = 0; 
#Hendif 


/* 


Ilo 
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ta bunch of stuff was allocated in proc.h. I’ve moved it here. 
* 


short freeproc list; _  /* Header of free proc table slots */ 
struct prochd qs[NQS]; 
int whichqs [NQELS] ; /* Bit mask summarizing non-empty qs’s */ 
gaa map *sysmap; /* Map of vaddr pool for system */ 

* 


* HACK ATTACK 


* Dux had defined this variable in cluster.c. Including this module, 

* however leads to many more dux modules having to be compiled and linked 
* into the kernel. Rather than deal with configurability now, we simply 
* hack around the problem, knowing full well that this isnt’ used for 

* anything outside of a discless environment anyway. 


#include "../dux/duxparam.h" 


#include "../dux/cct.h" 
struct cct clustab [MAXSITE] ; /* incore cluster configuration table */ 


/* File system async flag. If set file system data structures 
are written asychronously. */ 


int fs async = FS _ASYNC; 


/* 
* flag to control creation of "fast" symbolic links. 
ay 

int create_fastlinks = CREATE_FASTLINKS; 

/* 


* flag to turn off new AES conformance behavior for hp-ux system calls. 
me hpux_aes override = AES OVERRIDE; 
/* hash table size scale with number of items hashed */ 
/* lpow2 returns largest power of 2 less than arg, min value 16, max 8192 */ 


#define lpow2(arg) \ 
(arg) < 32? 16: \ 


(arg) < 64? 32: \ 
(arg) < 128? 64: \ 
(arg) < 256? 128: \ 
(arg) < 512? 256: \ 
(arg) < 1024? 512: \ 
(arg) < 2048? 1024: \ 
(arg) < 4096? 2048: \ 
(arg) < 8192? 4096: \ 
8192 


#define hashsize(length, items, default) \ 
(lpow2 ( (items) / (length) ) ) 
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/* proc table */ 


#define PIDHSZ hashsize(4, NPROC, 64) 
int PIDHMASK = PIDHSZ - 1; 
short pidhash[PIDHSZ] ; 


#define PGRPHSZ hashsize(4, NPROC, 64) 
int PGRPHMASK = PGRPHSZ - 1; 
short pgrphash [PGRPHSZ] ; 


#define UIDHSZ hashsize(4, NPROC, 64) 
int UIDHMASK = UIDHSZ - 1; 
short uidhash [UIDHSZ]; 


#define SIDHSZ hashsize(4, NPROC, 64) 
int SIDHMASK = SIDHSZ - 1; 
short sidhash [SIDHSZ]; 


/* sleep table */ 
#define SQSIZEDEF hashsize(4, NPROC, 64) 


int SQSIZE SQSIZEDEF; 
int SQMASK SQSIZEDEF-1; 


struct proc *slpque [SQSIZEDEF] ; 
struct proc *slptl1[SQSIZEDEF]; /* For FIFO sleep queues */ 


/* buffer table */ 

/* average buf hash chain length desired -- see machdep.c */ 

int bufhash_chain_length = 4; 

struct bufhd *bufhash; /* buffer hash table */ 

int BUFHSZ, BUFMASK; /* size and mask for accessing bufhash */ 
/* inode table */ 


#define INOHSZDEF hashsize(6, NINODE, 64) 


int INOHSZ = INOHSZDEF; 
int INOMASK = INOHSZDEF-1; 
union ihead { /* inode LRU cache, Chris Maltby */ 


union ihead *ih_head[2]; 
struct inode *ih_ _chain[2]; 
} ihead [INOHSZDEF] ; 
/* spinlocks */ 


#define SPINSIZEDEF (B_ SEMA HTBL SIZE + SQSIZEDEF + 50) 
int MAX SPINLOCKS = SPINSIZEDEF; 


lock_t spin_alloc _base [SPINSIZEDEF] { 0 }; 
lock_t *spin_alloc_end = spin_alloc “base + SPINSIZEDEF; 


int ddb_boot = DDBBOOT; 
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le @(#) 


#define 
#define 


struct 


a 


#ifdef 
extern 
#else 

#ifdef 
struct 
#else 

struct 
#tendif 
#tendif 


$Revision: 1.11.61.2 $ */ 


MSG MAGIC 0x063060 
MSG _BSIZE (4096 - 2 * sizeof (long) ) 


msgbuf { 

long msg_magic; 

long msg_bufx; 

char msg_bufc[MSG_BSIZE] ; 


__hp9000s800 
struct msgbuf msgbuf; 


__hp9000s300 
msgbuf Msgbuf; 


msgbuf msgbuf; 
/* else not __hp9000s300 */ 
/* else not —_hp9000s800 */ 
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Process Management 


The Big Picture 


- How does HP-UX share system resources among competing processes? 


The Little Picture(s) 
- The context of a process. 
- Signal handling & job control. 
- Process creation/deletion. 
- Fork - duplicate current process. 
- Exec - replace current program with another. 
- Context switching. 


- Tunable parameters. 


The Problem 


S$ ps -ef 
fork failed - too many processes 


What’s going on here? 
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Process Management 
The Context of a Process (running program) 


- Stack, text, and data areas. 


Registers, stack pointer, program counter, etc. 


Segment and page tables. 
- The u area - defined in /usr/include/sys/user.h. 


- available when process is in memory - won’t be paged out, 
but can be swapped with the process 


- has stuff like arguments to system calls, a place to save 
registers, the command that was typed, etc. These are 
things we don’t need to have available when the process is 
swapped out. 

- the kernel stack is part of the u area, but is not defined 
in user.h - it is actually ina different page and is not 
part of the "user structure". 

- the proc table entry - defined in /usr/include/sys/proc.h 

- stuff that needs to always be available - priority, PID, 

Signal masks, etc. 
- State 


- running - we are the currently executing process. 


- runnable - we are ready to run, and are waiting for 
the processor. 


- in a run queue based on our priority 


- stopped - we were running, but were stopped by ptrace(2) 
or we received a SIGTSTP (BSD Job Control). 


- sleeping - we are waiting for a resource. 


- in a sleep queue based on temporary priority 
(interruptible if sleep will *NOT* end quickly; 
comatose if it will :-) 


- zombie - we’ve exited, but parent hasn’t done a wait (2) 
' on us yet; *all* resources are freed up except the 
proc table entry (& u area in 8.0). 


68K process logical address space: 


5 pages 


float area 


2+ pages 
u area 


216 pages 
big gap 


for future 


use 


32 pages 
dragon 
area 


user stack 


user 
bss/data 


text 


OxfffffteKrcet 


98635 FP card is mapped in 
here if present & in use 


98248 FP card is mapped in 
here if present & in use 


<-- top of stack 
TS landnen gos AFG? 


Uge78 


Shared libraries (mmap (2) ed 
files) go here if in use 


q 2th s eke 
tgs Lips bbe 5 is as 


<-- top of data segment 


0x00000000 


WN 


68K u area looks like this: 


larger addresses 
1 page 
struct user 


1 page 
kernel stack 
Bh a ND la ees <-- top of 4k kernel stack 


smaller addresses 


In 8.0 and later releases, kernel stacks are actually allowed to use 4 
pages (rather than just 1), but this is not often done (most kernel 
functions do *not* use much stack space). 
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Process-eye View of Memory Management (68K) 


- The segment table pointer is the root of all address translation. 


seg. table page tables RAM 
[annneenees | (eset | 
re + | 4+------- V------- + | 4+------- V------- + 
--/ --/ 
--\ 
\seeeeeeeee [---------- 
+------- V------- + | 4------- V------- + 
--/ 
--\ 
Mat seems 
The 68020 uses 2-level tables, like this +------- V------- + 


diagram shows. The 68030 (in >=8.0) and 
68040 use 3-level tables - a "block table" 
is inserted between the segment and page 
tables, and the virtual address is split 
into 4 parts instead of 3. 


- The 680x0/MMU have stack and segment table pointers for both 
user and supervisor modes. Whenever a process gets to use the 
CPU, its segment table pointer and stack pointers are put into 
the appropriate hardware registers. In the table below, each 
item marked with Xs is changed at context-switch time. 


segment table stack 
user | XXXXX XXXXX | 
Hoe - ee ee ee ee er re ee ee eee + 
supervisor | | XXXXX | 


700 Per-process Virtual Address Space 


pre-9.0 


I/O SPACE 


| ~768 MB Shared Memory 


| 4K of gateway page(s) 


Quadrant 3 (sr6) 


~289 MB 
Shared Library data 
and. 
Private MMFs 


~80 MB 


BSS 


Initialized Data 


_—m—_ ee ewrewewenwnewewnaew nm ee ew mw Be Me eF— ee eg ew we Mw ww ee ee 


Quadrant 2 (sr5) 


Ce ee ee ee 


Quadrant 1 (sr4) 


| I/O SPACE 


| ~768 MB Shared Stuff 
| 4K of gateway page(s) 


Quadrant 3 (sré6) 


—_—e we me ew RO wh hw B— Mw Bw Mw Bw Bw Bw we ee eH Re ee 


Reserved/redzone 
kernel stack 
u area 


~80 MB 
user stack 


shared library data 
and/or private MMFs 


V 
| 
BSS 


Initialized Data 


Quadrant 2 (sr5) 


Text (shared, unless EXEC _MAGIC 
a.out; then text is "data": 
text+data+priv MMFs < 1.87GB) 


Quadrant 1 (sr4) 
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Signal Handling 


- Signal sending 
- crude form of IPC 


- accomplished with kill(2), which is the heart of 
kill(1), as in 


$ kill -1 2344 


- SIGUSR[12] are available for cooperating processes 


- Signal receiving or "catching" 


- Read signal(5) for an overview of the various signal 
families. 


- can be controlled somewhat with sig* (2) 


. Can specify a procedure to call when a given 
Signal comes in 


- can specify an alternate signal stack 
- if a non-default handler is specified, it will be called 
in such a way that it appears to be a normal procedure 
call 
- SIGKILL (as in "kill -9") can NOT be caught or ignored 


- special case for init(lm) - kill(2) will refuse 
to send SIGKILL to PID 1! 
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Signal Implementation 
- Signal sending 


- set a bit in the proc table entry of the receiving 
process 


- mark receiving process as runnable, *as long as it isn’t 
Sleeping at a priority of PZERO or less* - this is 
important to remember, but shouldn’t often be an issue 


- Signal receiving 
- check to see if we have signal(s) pending whenever we’re 
about to return to user mode from kernel mode and 
whenever we block in the kernel (by calling sleep()). 
- if we do, handle them or core dump or exit or whatever.... 
- if we were in the middle of a system call, we may restart 


it or we may return an error - depends on what programmer 
asked for. 
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Process Creation/Deletion 


- Created 


by fork(2). 

most things are exactly duplicated 

things like pid, ppid, etc. are different 
stdio buffers are duplicated 


vfork(2) is a fast version - it does NOT copy the stack 
and data - it trusts the child to do an exec 


- in 8.0, copy-on-write has made normal fork(2) 
fast as well 


- Currently-running program replaced by exec(2). 


- Deleted 


>> oz 
>> 


things like file descriptors are preserved 


things like "when this signal comes in, call this 
routine" are NOT preserved 
by exit(2) (voluntary), or most signals (involuntary). 


note that unless parent process does a wait(2), there << 
will be a zombie sitting around... << 


- A process gets created whenever Cal Loh 
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What Happens When Fork(2) Is Called 


- The general idea is to "xerox" the calling process, changing 
only the things that must be unique (PID, resource usage, etc) 


- Specifics: 
- child will share 


- text (code - including shared libs, if used), 
shared memory; in general, any SHARED regions 


- references to open files, current/root dirs 
- child must have its own 

- proc table entry and u area 

- page tables (68K) 


- if this is a real fork and not a vfork, child 
will have its own 


- data 
- stack 
- Swap area for the above 


- vfork(2) is a fast, cheap alternative to fork(2) - useful when 
all we want to do is exec(2) something; the basic idea is to 


borrow the parent’s resources rather than making copies of them 


that are immediately thrown away 
- in 8.0, fork(2) is implemented with copy-on-write 
- parent and child have the same physical pages mapped 
- pages are marked readonly 


- when parent *or* child modifies a page, it gets a 
private copy of that page 


- most of the time, very few pages are modified before 
the child exits or execs; this winds up being a 
Significant performance win 


- vfork(2) was initially implemented this way (in 8.x), 
but this caused *serious* problems:: 
- the child had to have swap space allocated 
- programs that used it as cheap shared memory 
broke 


Le. 
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What Happens When Exec(2) Is Called 


Check modes: execute bits, set[ug]id bits, etc. 


Read in first few bytes to see what kind of file it is. 


If it is non-shared, lump the data and text together as data. 


If it is a "#!" script, loop to get the real executable file. 


Be sure the file is as big as the header claims, but not too big. 


Copy arguments to a buffer. 

Be sure the file is big enough to have text, data, etc. 
Be sure text isn’t busy: ptrace(2), open for write, etc. 
Get *swap* space. | 

Release any locked memory. 


If we are a "vfork child", give memory back to the parent; 
otherwise, release memory. 


Get virtual memory (actually just initialize page tables to 
the appropriate thing - usually zero-fill-on-demand). 


Read data (and text if non-shared) in. 
Attach to text, reading it in if necessary. 
Set uid/gid. 


Copy arguments from buffer to new stack. 


Set registers (mostly clear them, but one is used to tell if we 
have a floating point card and one is used to indicate processor 


type). 


Reset caught signals - there’s nothing to catch them anymore! 


Close close-on-exec files. 


J 


Context 
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Switching - Priorities 


Our fundamental goal is to be running the most important 
process at any given time; for a typical process, its 
"importance" is determined by its recent CPU usage and 
its nice value. 


Every time the clock ticks (50 times/sec = every 20 ms for 
the 300/400, 100 times/sec = every 10 ms for the 700), 

the process that was running when the clock interrupted is 
charged with a "tick" of CPU time (i.e. its p_cpu gets 
incremented) . 


The system keeps a rough count of the number of processes that 
are either runnable or will/could be very soon in an array 
called "avenrun"; this is often referred to as the "load average" 
and is what things like xload/top/uptime/monitor print. 


p_cpu is decayed once per second, and all process priorities are 
recalculated: 


- p_cpu = p_cpu*(2*load_ave)/(2*load_ave + 1) + nice value 
- p_usrpri = PUSER + p_cpu/4 + 2*nice_ value 
If process has been rtprio()’ed, forget the 2nd part.... 


Process priorities are recalculated every second for all ~ 
processes on the system (via the two equations above), and 
every four clock ticks for the current process. 


When some process becomes more important than the current one, 
a context switch is requested. The switch won’t actually happen 
until we are ready to go back into user mode. 


A switch will automatically be requested every timeslice/HZ 

of a second. Since timeslice is normally HZ/10, we will 

default to requesting a switch every 1/10th of a second. 
300/400: HZ = 50 700: HZ = 100 


Context 
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Switching - Mechanics 


Can only happen when 
- process blocks by calling sleep() (in the kernel) ; 
- process is about to return to user mode from kernel mode; 
this could be a return from an interrupt or exception 
handler or a system call. 


Save current context into u area, which is mapped into the top of 
the process’ address space on the 68K and quadrant 2 on PA systems 


Restore other process’ context from its u area. 


Resume execution. 


Context 
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Switching - Being Nice :-) 


- Before 9.0, the "nice value" was used in_the equations for 
calculating process priority and had a *small* influence 
on the swapper. It affected how much a process could use 
the CPU, but did not really affect how much of the system’s 
throughput a process could consume. 


- In 9.0, a process’ nice value will have more effect on 
how much it can do - the pager and swapper pay *much* more 
attention to the nice value than they used to. This can be 
used in positive and negative ways - to preserve interactive 
performance, one could negatively nice the X server and 
positively nice the chip simulator running in the background. 


- nice(1) is a command wrapper around nice(2), which will change 
the nice value of the current process (must be root to improve 
it :-) | 


- renice(1) uses setpriority (2), which allows an appropriate user 
to change the priority of other processes (not just the current 
one). Top (version 2.5 or greater) also uses this. 


[1 
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Tunable Parameters 


maxfiles - default number of files a single process can open 


defaults to 60 


- maxfiles_lim - number of files a process can open if it does 
a setrlimit(2) call 


- maxupre 


- nproc - 


defaults to 1024 


number of processes a single user (UID) can have 


setting it high allows a single user to take lots of the 
system’s resources 


setting it low can cause users to get angry 
maximum number of processes on the system at any given tim 
used to size a static array, the proc table 


it is also used to size other kernel data structures 
that relate to the number of processes on the system 


- timeslice - length of timeslice for round-robin CPU scheduling 


normally i00ms (timeslice of "5" on 68K, "i0" on PA) 


setting it too low makes us spend more of our 
time switching, less of it working 


setting it too high means interactive response is bad 


Kernel Variables Of Interest 


- _nproc, 


_timeslice from above; both are integers 


- proc - pointer to proc table; defined in proc.h 


- u area 


see getu.c 


IF) 


Summary 
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Process Management 


- A process is a running program, and consists of text, data, 


and stack areas as well as au area and proc-table entry. Most 
processes also use shared libraries, and some use shared memory. 


Context switching refers to the kernel’s efforts to be sure we 
are running the "right" process at any given time. Processes 
"lose" priority by using up CPU time, and the kernel sees if it 
should switch processes any time the CPU is going from kernel 
mode to user mode. 


Each process gets a slot in the "proc table", and this table is 
sized by "nproc" (a tunable parameter). This parameter is also 
used to size other things, so it is a good one to bump up if 
there are general resource problems on the system. 


The proc-table entry is unique in that it will never be swapped 
out for as long as the process exists. This is important, and 
has much to do with the next point.... 


To send a signal, all we do is set a bit in the proc-table entry 
of the receiving process, and (possibly) mark it runnable. 


process logical address space: 


4GB 4GB 1/0 
shared stuff 
gateway pgs 

2 pages 
u area 1 MB 
k stack 


shared stuff 


user stack 
Sec ei ln Sa <-- top of stack Sasa Reha 
| | stack, u area, 
\/ k stack, sh 
lib data, data 


sh mem 

sh libs 

mmap files code (+ data 
if using 
EXEC_MAGIC) 

f | 

Jo ene <-- top of data 

bss/data 

text 


[ 
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/* 
* @(#)proc.h: $Revision: 1.65.61.12 $ $Date: 92/06/29 10:44:30 $ 


* 


*/ 


#ifdef — hp9000s300 
#include <machine/pte.h> 
#endif /* —_hp9000s300 */ 


WoOWA USF WD EB 


10 #ifdef _hp9000s800 
11 #include <sys/fss.h> 
12 #endif /* __hp9000s800 */ 


14 #include <sys/vas.h> 

15 #include <sys/pregion.h> 
16 #include <sys/time.h> 

17 #include <sys/mman.h> 


18 

19 /* Values for vfork_state field in struct vforkinfo */ 
20 

21 #define VFORK_INIT 0 


22 #define VFORK_PARENT 1 
23 #define VFORK_CHILDRUN 2 
24 #define VFORK_CHILDEXIT 3 


25 #define VFORK_BAD 4 
26 
2h f= 


28 * The following structure is used by vfork to hold state while a 
29 * vfork is in progress. 


30 * / 

31 

32 struct vforkinfo { 

33 int vfork_state; 

34 struct proc *pprocp; 

35 struct proc *cprocp; 

36 - unsigned long buffer_pages; 

37 unsigned long u_and_stack_len; 
38 #ifdef —_hp9000s300 

39 unsigned char *u_and_stack_addr; 
40 #endif 

41 +#ifdef _hp9000s800 

42 unsigned long saved_rp ptr; 

43 unsigned long saved_rp; 

44 #endif 

45 unsigned char *u_and_stack_buf; 
46 struct vforkinfo *prev; 

47}; 

48 

49 /* 

50 One structure allocated per active 


* 

* process. It contains all data needed 
52 * about the process while the 

* process may be swapped out. 

* Other per process data (user.h) 

* is swapped with the process. 
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typedef struct 


#ifdef 


#else 


#endif 


struct 
struct 
u_char 
u_char 
u_char 
char 
char 
char 
char 
int 
int 
int 
int 
int 
int 
int 


proc { 

proc *p_link; 
proc *p_rlink; 
p_usrpri; 
p_pri; 
p_rtpri; 
P_cpu,; 
p_stat; 
p_nice; 
p_cursig; 
p_sig; 
p_sigmask; 
p_sigignore; 
p_sigcatch; 
p_flag; 
p_flag2; 
p_coreflags; 


CLASSIC_ID TYPES 


u_short 


uid_t 


~u_short p_filler_uid; 


p_uid; 


p_uid; 


#ifdef _CLASSIC_ID TYPES 


#else 


#tendif 


u_short 
u_short 


uid_t 


p_filler_ suid; 
p_suid; 


p_suid; 


#ifdef CLASSIC_ID TYPES 


#else 
#endif 
#ifdef 
#telse 
tendif 
#ifdef 
#telse 


#endif 


u_short 
short 


gid t 


u_short 
short 


pid.t 


u_short 
short 


pid.t 


caddr_t 
size_t 
u_short 
long 
float 
short 
short 
short 
short 


p_filler_pgrp; 
P_P9rp; 


P_P9rp; 


_CLASSIC_ID_ TYPES 


p_filler_pid; 
p_pid; 


p_pid; 


_CLASSIC_ID_ TYPES 


p_filler_ppid; 
P_ppid; 


p_ppid; 
p_wchan; 


Pp_Maxrss; 
p_cpticks; 


p_cptickstotal; 


p_pctcpu; 
p_idhash; 
P_pgrphx; 
p_uidhx; 
p_fandx; 
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/* 


/* 
/* 
/* 
/* 


/* 


/* 
/* 
/* 
/* 
/* 
/* 
/* 


/* 
/* 


/* 
/* 


/* 
/* 


/* 
/* 


/* 
/* 


/* 
/* 
/* 
/* 
/* 
/* 
/* 
/* 
/* 


linked list of running processes */ 


user-priority based on p_cpu and p_nice */ 
priority, lower numbers are higher pri */ 
real time priority */ 

cpu usage for scheduling */ 


nice for cpu usage */ 

signals pending to this process */ 
current signal mask */ 

signals being ignored */ 

signals being caught by user */ 
see flag defines below */ 

more flags; see below */ 

core file options; see core.h */ 


user id, used to direct tty signals */ 


user id, used to direct tty signals */ 


set (effective) uid */ 


set (effective) uid */ 


name of process group leader */ 


name of process group leader */ 


unique process id */ 


unique process id */ 


process id of parent */ 
process id of parent */ 


event process is awaiting */ 

copy of u.u_limit [MAXRSS] */ 

ticks of cpu time */ 

total for life of process */ 

cpu for this process during p_time */ 
hashed based on p_pid for kill+exit+... */ 
pgrp hash index */ 

uid hash index */ 

free/active proc structure index */ 


1, 


Oct 14 10:39 1992 


Lis 
114 
115 
116 
117 
118 
119 
120 
121 
122 
123 
124 
125 
126 
127 
128 
129 
130 
131 
132 
133 
134 
135 
136 
137 
138 
139 
140 
141 
142 
143 
144 
145 
146 
147 
148 
149 
150 
151 
152 
153 
154 
155 
156 
157 
158 
159 
160 
161 
162 
163 
164 
165 
166 
167 
168 


short 
struct 
struct 
struct 
struct 
struct 
vas_t 
preg t 
ushort 
struct 


short 
short 
u_short 
char 
char 
short 
struct 
sid_t 
short 
short 
struct 
struct 
u_char 
u_char 


caddr_t 


*/ 


p_pandx; 

proc *p_pptr; 
proc *p_cptr; 
proc *p_osptr; 
proc *p_ysptr; 
proc *p_dptr; 
*p_vas; 
*p_upreg; 
p_mpgneed; 
proc *p mlink; 
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/* 
/* 
/* 
/* 
/* 
/* 
/* 
/* 
/* 


previous active proc structure index */ 


pointer 
pointer 
pointer 
pointer 
pointer 
Virtual 
Pointer 


to process structure of parent */ 
to youngest living child */ 

to older sibling processes */ 

to younger siblings */ 

to debugger, if not parent */ 
address space for process */ 

to pregion containing U area */ 


number of memory pages needed */ 


/* 
/* 
/* 
# pages 
# pages 


link list of processes * / 
sleeping on memwant or * / 
swapwant. */ 


reserved by this proc */ 
reserved by swapper this proc */ 


exit stauts */ 
resident time for scheduling */ 
time since last block */ 


session 
session 
process 


p_memresv; /* 
p_swpresv; /* 
p_xstat; /* 
p_time; /* 
p_slptime; /* 
p_ndx ; 
itimerval p_realtimer; 
p_sid; 72 
p_sidhx; /* 
p_idwrite; /* 
fss *p_fss; /* 


dbipe *p_dbipcp; 
p_wakeup_pri; 
p_reglocks; 


p_filelock; 


struct proc *p_cttyfp; 
struct proc *p_cttybp; 


/* 
/* 
/* 
/* 
/* 


/* 
/* 
/* 
/* 


ID */ 
ID hash index */ . 
ident write flag for auditing */ 


fair share group pointer */ 
dbipc pointer */ 
priority when proc awakens on semaphore */ 


num reglock()’s held (see vm_sched.c) 


*/ 


VASSERTS in region.h know this is 1 byte */ 


address 


of file lock region process is 


either blocked on or about to block on.*/ 
/* Doubly linked list of processes sharing the same controlling tty. ~~ 
* Head of list is u.u_procp->p_ttyp->t_cttyhp. 


forward ptr */ 
backward ptr */ 


Process deadlock channel 


Process 


a] 
of A 


forwarding address 


/* Fields used by the pstat system call. */ 


/* controlling tty pointer */ 
/* generic counter, wakeup when goes to 0 */ 


*p_recover_sema; /* Semaphore to recover on exit from sleep */ 


caddr_t p_dichan; 
site _t p_faddr; 
struct timeval 
p_utime, 
p_stime; 
dev_t p_ttyd; 
time_t p_start; 
struct tty *p_ttyp; 
int p_wakeup_cnt; 
#ifdef MP 
#ifdef SYNC_SEMA_ RECOVERY 
sema_t 
#endif 
int p_descnt; 
int p_desproc; 
int p_mpflag; 
int Pp_procnum; 
#endif /* MP */ 


/* proc desire age */ 


/* processor desired 


*/ 


/* mp flag */ 
/* Processor it ran on, just for user info */ 


struct proc *p_wait_list; 
struct proc *p_rwait_list; 


/* Forward link for wait list */ 
/* Backward link for wait list */ 
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169 
170 
pie 3 
172 
173 
174 
L75 
176 
177 
178 
179 
180 
181 
182 
183 
184 
185 
186 
187 
188 
189 
190 
191 
192 
LoS 
194 
195 
196 
197 
198 
199 
200 
201 
202 
203 
204 
205 
206 
207 
208 
209 
210 
211 
212 
213 
214 
215 
216 
217 
218 
219 
220 
221 
222 
223 
224 


10:39 1992 edited 9.0 proc.h Page 4 


struct sema *p sleep_sema; /* semaphore process is blocked on */ 
struct sema *p_sema; /* alpha: head of per-process semaphore list * 


/* These fields have been moved from user.h because you can no 
* longer retreive this information from a uarea which has been 
* swapped out. 


* 

/ 

int p_maxof ; /* max number of open files allowed */ 

struct vnode *p_cdir; /* current directory */ 

struct vnode *p_rdir; /* root directory of current process * 

struct ofile_t **p_ofilep; /* pointers to file descriptor chunks 
to be allocated as needed. */ 

struct vforkinfo *p_vforkbuf; /* Vfork state information pointer */ 


struct msem_procinfo *p_msem_info; /* Pointer to msemaphore info struc 


/* All workstation specific fields */ 

#ifdef WSIO 
/* support for dil interrupts */ ; 
struct buf *p_dil_event_f; /* head of list of pending dil interrupts * 
struct buf *p dil_event_1; /* tail of list of pending dil interrupts * 


struct pte *p_ addr; /* u-area kernel map address */ 
struct ste *p_ segptr; /* physical segment table pointer */ 
int p_stackpages; /* Number of private kernel stack pages */ 


u_char p_dil_signal; /* which signal to use for DIL interrupts */. 
#endif /* WSIO */ 


#ifdef _hp9000s300 

/* Only the 300 uses these time fields in this manner */ 
#define p_uticks p_utime.tv_sec 

#define p_ sticks p_stime.tv_sec 


#endif /* __hp9000s300 */ 


/* All 800 specific fields */ 
#ifdef __hp9000s800 


u_short p_pindx; /* index of this proc table entry */ 
# ifdef _WSIO 
caddr_t graf_ss; /* graphics per-process (mostly coproc) data * 
# endif 
#endif /* —_hp9000s800 */ 
} proc_t; 


: /* chain */ 
extern struct proc *pfind(); 
extern struct proc *proc, *procNPROC; /* the proc table itself */ 
extern int nproc; 


#ifdef _hp9000s800 : 
#define NQS 160 /* 160 run queues = 128 RT + 32 TS */ 


#define NQSPEL 16 /* Number of run queues per whichgs element*/ 
#define NQSPELLG 4 /* log2(NQSPEL) */ 
#define NQELS (NOS /NQSPEL) /* 10 elements to hold bitmask(whichgs) */ 


ie 
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225 #define TSQ 128 — /* First time-sharing queue */ 

226 #define TSPRI_TO RUNQ(pri) (TSQ + (((pri)-PTIMESHARE) >> 2)) 

227 #else /* not __hp9000s800 */ 

228 #define NQS 256 /* 256 run queues 128 RT + 128 TS */ 

229 #define NQSPEL 32 /* Number of run queues per whichgs element */ 
230 #define NQELS (NOS/NQSPEL) /* 8 32-bit elements to hold bitmask (whichgs) 
231 #define TSPRI_TO_RUNQ (pri) (pri) /* Don’t use anywhere but schedcpu! */ 
232 #endif /* not __hp9000s800 */ 

233 

234 struct prochd { 

235 struct proc *ph_link; /* linked list of running processes */ 

236 struct proc *ph_rlink; 

237 3}; 

238 

239 extern struct prochd gs {[NQS]; 

240 extern int whichgs [NQELS]; /* bit mask summarizing non-empty qs’s */ 
241 #endif /* KERNEL */ 

242 

243 /* stat codes */ 

244 #define SSLEEP 1 /* awaiting an event */ 

245 +#define SWAIT 2 /* (abandoned state) */ 

246 +#define SRUN 3 /* running */ 

247 #define SIDL 4 /* intermediate state in process creation */ 
248 +#define SZOMB 5 /* intermediate state in process termination * 
249 +#define SSTOP 6 /* process being traced */ 

250 

251 /* flag codes (p_flag) */ 

252 #define SLOAD 0x00000001 /* in core */ 

253 #define SSYS 0x00000002 /* swapper or pager process */ 

254 #define SLOCK 0x00000004 /* process being swapped out */ 

255 #define STRC 0x00000008 /* process is being traced */ 

256 #define SWTED 0x00000010 /* another tracing flag */ 

257 +#define SKEEP 0x00000040 /* another flag to prevent swap out */ 

258 #define SOMASK 0x00000080 /* restore old mask after taking signal */ 
259 #define SWEXIT 0x00000100 /* working on exiting */ 

260 #define SPHYSIO 0x00000200 /* doing physical i/o (bio.c) */ 

261 #define SVFORK 0x00000400 /* Vfork in process */ 

262 +#define SSEQL 0x00000800 /* user warned of sequential vm behavior */ 
263 +#define SUANOM 0x00001000 /* user warned of random vm behavior */ 

264 #define SOUSIG 0x00002000 /* using old signal mechanism */ , 

265 #define SOWEUPC 0x00004000 /* owe process an addupc() call at next ast */ 
266 #define SSEL 0x00008000 /* selecting; wakeup/waiting danger */ 

267 #define SRTPROC 0x00010000 /* real time processes */ 

268 +#define SSIGABL 0x00020000 /* signalable process */ 

269 #define SPRIV 0x00040000 /* compute privilege mask */ 

270 #define SPREEMPT 0x00080000 /* Preemption flag */ 

271 +#ifdef HPNSE 

272 #define SPOLL 0x00100000 /* process is polling */ 

273 #endif 

274 

275 +#ifdef _WSIO 

276 /* more p_flag bits, used for process deactivation */ 

277 +#define SSTOPFAULTING 0x00200000 

278 +#define SSWAPPED 0x00400000 

279 +#define SFAULTING 0x00800000 

280 
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/* used to track number of faulting processes (not a p flag bit) */ 
#define FAULTCNTPERPROC 8 
#endif /* _WSIO */ 


/* flags for p_flag2 */ 


#define S2CLDSTOP 0x00000001 /* send SIGCLD for stopped processes * 

#define S2EXEC 0x00000002 /* if bit set, process has completed 
an exec(OS) call */ 

#define SGRAPHICS 0x00000004 /* The process is a graphics process * 

#define SADOPTIVE 0x00000008 /* process adopted using ptrace */ 

#ifdef _hp9000s800 

#define SSAVED 0x00000010 /* registers saved for ptrace */ 

#define SCHANGED 0x00000020 /* registers changed by ptrace */ 

#define SPURGE_SIDS 0x00000100 /* purge cr1l2 and cr13 in resume() */ 


#endif /* _ hp9000s8s00 */ 
#ifdef __hp9000s300 


#define S2DATA_WT 0x00000010 /* Process’s data segment is write thr 
#define S2STACK_WT 0x00000020 /* Process’s stack segment is write th 
#endif /* _ hp9000s300 */ . 

#define SANYPAGE 0x00000040 /* Doing any kind of pageing */ 
#define SPA_ON 0x00000080 /* Under consideration for 


activation control */ 
#define S2POSIX_NO_TRUNC 0x00001000 /* no truncate flag for pathname lookup* 
#define POSIX_NO TRUNC S2POSIX_NO_TRUNC /* until dux_sdo.c is fixed */ 


#ifdef _WSIO 
#define S2SENDDILSIG 0x00000200 /* whether to send DIL interrupt (cleared o 
#endif /* _WSIO */ 
#define SLKDONE 0x00000400 /* Process has done lockf() or fentl () 
#define SISNFSLM 0x00000800 /* Process is NFS lock manager. */ 

/* See nfs_fentl() in nfs_server.c */ 


#define S2TRANSIENT 0x00002000 /* transient flag (fair share scheduler) */ 


#ifdef MP 
/* These are p_mpflag values */ 
#define SLPT 0x00000001 /* a Lower Priv Transfer trap brought 
#define SRUNPROC , 0x00000002 /* Running on a processor */ 
#define SMPLOCK 0x00000004 /* Locked */ 
#define SMP_SEMA WAKE 0x00000008 /* proc awakened by V operation, 

not signal */ 
#define SMP STOP 0x00000010 /* Process entering stopped state. */ 
#define SMP_SEMA BLOCK 0x00000020 /* Process blocked on semaphore */ 
#define SMP_SEMA NOSWAP 0x00000040 /* Do not swap this process */ 


#endif /* MP */ 


#ifdef —_hp9000s300 

#define PROCFLAGS2 (SADOPTIVE|S2EXEC|S2SENDDILSIG) 

#endif 

#ifdef —_hp9000s800 

#define PROCFLAGS2 (SADOPTIVE|S2EXEC|SCHANGED | SSAVED | S2TRANSIENT) 
#endif 


/* Constants which are used to call newproc */ 
#define FORK_PROCESS 1 
#define FORK_VFORK 2 
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FORK_DAEMON 


10:39 1992 edited 9.0 proc.h Page 7 


3 


/* Return values for newproc/procdup */ 
#define FORKRTN PARENT 0 


#define FORKRTN CHILD 
#define FORKRTN_ERROR 


1 
= 1 


/* Constants which can be used to index proc table for kernel daemons*/ 


#define 
#define 
#define 
#define 
#define 


S_SWAPPER 
S INIT 
S_PAGEOUT 
S_STAT 
S_DONTCARE 


0 
1 
2 
3 
21 


/* Constants which can be used for pid argument to newproc() */ 
/* Note: proc table slot and pid may be different for some processes * / 


#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 


PID_SWAPPER 
PID INIT 

PID _PAGEOUT 
PID STAT 

PID _LCSP 

PID NETISR 
PID _SOCKREGD 
PID _VDMAD 
PID MAXSYS 


0 
1 
2 
3 
4 
5 
6 
7 
7 


/* Used in dux/getpid.c */ 
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/* @(#) $Revision: 1.65.61.10 $ */ 


#include <machine/pcb.h> 

#include <sys/time.h> 

#include <sys/resource.h> 

#include <sys/privgrp .h> 

#include <errno.h> /* u_error codes */ 
#include <sys/signal .h> /* SIGARRAYSIZE */ 
#include <sys/proc.h> 

#ifdef _—_hp9000s300 

#include <a.out.h> 

#endif /* —_hp9000s300 */ 

#ifdef — hp9000s800 

#include <sys/vmmac.h> 

#include <machine/save_state.h> 

#include <machine/som.h> 

#endif /* — hp9000s800 */ 


/* 

* NFDCHUNKS = number of file descriptor chunks of size SFDCHUNK available 
* per process. SFDCHUNK must be NBTSPW = number of bits per int for 

* select to work. 


xf 
#define SFDCHUNK NBTSPW 
#define NWORDS (n) ((((n) & (SFDCHUNK - 1)) == 0) ? (n >> 5) : \ 


((n >> 5) + 1)) 
/* NWORDS is the number of words necessary for n file descriptors to allow 
for one bit per file descriptor. */ 


#define NFDCHUNKS (n) NWORDS (n) 


/* 
* Some constants for fast multiplying, dividing, and mod-ing (%) by SFDCHUNK 
* / ; 


#define SFDMASK Oxl1f 
#define SFDSHIFT 5 


struct ofile t { 
struct file *ofile [SFDCHUNK] ; /* file descriptor slots */ 
char pofile [SFDCHUNK] ; /* per process open file flags */ 


* since fuser() needs this information, we move it to the proc structure 

* since uareas can be swapped out. In previous releases, fuser() was 

* able to scan through the logical swap device to retrieve this information 
* however, that capability is no longer supported. 


* 

/ 

#define u_maxof u_procp->p_maxof /* max # of open files allowed */ 

#define u_rdir u_procp->p_rdir /* root directory of current process * 

#define u_cdir u_procp->p_cdir /* current directory */ 

#define u_ofilep u_procp->p_ofilep /* pointers to file descriptor chunks 
to be allocated as needed. */ 

/* 


Ou 
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57 * maxfiles is maximum number of open files per process. 

58 * This is also the "soft limit" for the maximum number of open files per 

59 * process. maxfiles_lim is the "hard limit" for the maximum number of open 
60 * files per process. 

61 * / 

62 extern int maxfiles; 

63 extern int maxfiles_lim; 


64 

65 #define LOCK_TRACK_MAX 10 /* for gfs lock tracking */ 
66 

67 /* 


68 * Per process structure containing data that 
69 * isn’t needed in core when the process is swapped out. 


70 * / 

71 

72 #define SHSIZE 32 
73 


74 typedef struct user { 
75 #ifdef —_hp9000s800 


76 struct pcb u_pcb; 

77 #endif 

78 struct proc *u_procp; /* pointer to proc structure */ 

79 #ifdef —_hp9000s800 

80 struct save_state *u_sstatep; /* pointer to a saved state */ 

81 #endif 

82 #ifdef —_hp9000s300 

83 int *u_ar0; /* address of users saved RO «*/ 

84 t#endif /* —_hp9000s300 */ 

85 char u_comm [MAXCOMLEN + 1]; 

86 

87 /* syscall parameters, results and catches */ 

88 int u_arg[10] ; /* arguments to current system call */ 
89 int *u_ap; /* pointer to arglist */ 

90 label_t u_qsave; /* for non-local gotos on interrupts * 
91. u_short u_spare_short; /* Replaces top half of u_error */ 
92 u_short u_error; /* return error code */ 

93 

94 union { /* syscall return values */ 

95 struct { 

96 int R_vali; 

97 int R_val2; 

98 } u_rv; 


99 #define r_vall u_rv.R_vall 
100 #define r_val2 u_rv.R_val2 


101 

102 /* Bell-to-Berkeley translations */ 

103 #define u_rvall u_r.r_vall 

104 #define u_rval2 u_r.r_val2 

105 

106 off_t TAOLrek: 

107 time_t r_time; 

108 } ais 

109 char u_eOsys; /* special action on end of syscall */ 
110 u_short u_syscall; /* syscall # passed to signal handler 
111 


112 /* 1.1 - processes and protection */ 
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struct 
u_uid 
u_gid 


#define 
#define 
#define 
#define 
#define 
aid_t 
short 
short 
struct 
struct 
struct 
char 


u_groups u_cred->cr_groups 
u_ruid u_cred->cr_ruid 
u_rgid u_cred->cr_rgid 
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ucred *u_cred; 
u_cred->cr_uid 
u_cred->cr_gid 


/* 


u_aid; /* 
u_audproc; /* 
u_audsusp; /* 


audit_filename *u_audpath; 
audit_string *u_audstr; /* 
audit_sock *u_audsock; /* 
*u_audxparam; /* 


#ifdef _—_hp9000s800 


u_int 


/* 


u_sparel [5] ; 


#endif /* —_hp9000s800 */ 


#ifdef _CLASSIC_ID_TYPES 
unsigned short u_filler_sgid; 
unsigned short u_sgid; /* 
#else 
gid t u_sgid; /* 
#tendif 
u_int u_priv[PRIV_MASKSIZ] ; /* 
/* 1.2 - memory management */ 
label_t u_ssave; Ys 
#ifde£f hp9000s800 
tlabel_t u_psave; {* 
#endif /* —_hp9000s800 */ 
#ifdef —_hp9000s300 
. label_t u_rsave; {/* 
label_t u_psave; /* 
H#endif /* _hp9000s300 */ 
time_t u_outime; /* 
short u_flag; /* 
#define UF_MEMSIGL 0x00000001 /* 


* 


/* 1.3 - signal management */ 


void 
int 
int 
int 
int 
struct 


#define u_onstack 
#define u_sigsp 


/* same for users and the kernel; see signal.h 
(*u_signal [SIGARRAYSIZE] ) () ; 


u_sigmask [SIGARRAYSIZE] ; 


u_sigonstack; /* 
u_oldmask ; /* 
u_code ; : /* 
sigstack u_sigstack; /* 


u_sigstack.ss_sp 


#ifdef _hp9000s800 


void 


#define PA83_CONTEXT 
#define PA89_ CONTEXT 


int 


(*u_sigreturn) () ; i 
Ox1 
Ox2 
u_sigcontexttype; 


/* 


#endif /* _hp9000s800 */ 
#ifdef —_hp9000s300 


int 


/* 


u_sigcode [6] ; 


/* user credentials (uid, gid, etc) * 


groups, NOGROUP terminated */ 


audit id */ 
audit process flag */ 
audit suspend flag */ 

/* ptr to audit pathname info 
ptr to string data for auditing */ 
ptr to sockaddr data for auditing * 
generic loc. to attach audit data * 


spares for backward compatibility * 


set (effective) gid */ 
set (effective) gid */ 


privlege mask */ 


label variable for swapping */ 


trap recovery vector - machine dep 


for exchanging stacks */ 
for probe simulation */ 


user time at last sample */ 
See u_flag values */ 

Signal upon memory allocation 
and process locked 


* 

/ 
/* disposition of signals */ 
/* signals to be blocked */ 
signals to take on sigstack */ 
saved mask from before sigpause */ 
‘‘code’’ to trap */ 

sp & on stack state variable */ 


u_sigstack.ss_onstack 


handler return address */ 
to tell PA83 from PA89 contexts */ 


signal "trampoline" code */ 


ite 
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#endif /* —_hp9000s300 */ 


int u_sigreset; /* reset handler after catching */ 
#ifdef _hp9000s300 
size_t u_lockovh; /* locked proc overhead size (clicks) 


/* belongs with u_locksdsize */ 
#endif /* —_hp9000s300 */ 


/* 1.4 - descriptor management */ 


#define UF_EXCLOSE Ox1 /* auto-close on exec */ 
#define UF_MAPPED 0x2 /* mapped from device */ 
int u_highestfd; /* highest file descriptor currently 


opened by this process. */ 
#ifdef _WSIO 


struct file *u_fp; /* current file pointer */ 
#endif /* _WSIO */ 
#define UF_FDLOCK 0x4 /* lockf was done,see vno_lockrelease 
int u_spare2 [1] ; /* spare */ 
#ifdef HPNSE 
dev_t u_ttyd; /* controlling tty dev */ 
#endif . 
short u_cmask; /* mask for file creation */ 


/* 1.5 - timing and statistics */ 
/* The user accumulated seconds and system accumulated seconds fields 
* of the following structure are maintained in the proc structure. 
* This should be taken into account in computations. 


* 

/ 

struct rusage u_ru; /* stats for this proc */ 

struct rusage u_cru; /* sum of stats for reaped children */~ 
struct itimerval u_timer [3]; 

int u_XXX [2]; 


time_t u_ticks; 
short u_acflag; 


/* 1.6 - resource controls */ 
struct rlimit u_rlimit [RLIM_NLIMITS] ; 


/* BEGIN TRASH */ 


char u_segflg; /* O:user D; l:system; 2:user I */ 
caddr_t u_base; /* base address for IO */ 
unsigned int u_count; /* bytes remaining for I0 */ 

off_t u_offset; /* offset in file for IO */ 


#ifdef —_hp9000s800 
/* The magic number, auxillary SOM header and spares */ 
struct { 
int u_magic; 
struct som_exec_auxhdr som_aux; 
} u_exdata; 
#endif /* _—_hp9000s800 */ 
#ifdef —_ hp9000s300 


union { 
struct exec Ux_A; 
char ux_shell [SHSIZE] ; /* #! and name of interpreter */ 


} u_exdata; 


vam | 
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pathname pointer */ 


QFS transaction header */ _ 
stack size of lock keys */ 
for debug */ 

; /* stack of lock keys */ 


count of locked devices */ 


spares for backward compatibility * 

process single stepping flags */ 

link register */ 

process is single stepping */ 

pec queue modified */ 

branch and link at pcq head */ 

branch external at pcq head */ 

pc space and offset queue */ 
values for single stepping */ 


ipsw for single stepping */ 
value for general register 1 */ 
value for general register 2 */ 


profile arguments */ 
buffer base */ 
buffer size */ 

pe offset */ 
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#endif /* —_hp9000s300 */ 
#ifdef —_hp9000s800 
int u_spare [9] ; 
#define ux_mag u_magic 
#define ux_tsize som_aux.exec_tsize 
#define ux_dsize som_aux.exec_dsize 
#define ux_bsize som_aux.exec_bsize 
#define ux_entloc som_aux.exec_entry 
#define ux_tloc som_aux.exec_tfile 
#define ux_dloc som_aux.exec_dfile 
#define ux_tmem som_aux.exec_tmem 
#define ux_dmem som_aux.exec_dmem 
#define ux_flags som_aux.exec_flags 
#define Z_EXEC_FLAG Ox1 
#endif /* — hp9000s800 */ 
#ifdef _hp9000s300 
#define ux_mag Ux_A.a_magic.file type 
#define ux_system_id Ux_A.a_magic.system_id 
#define ux_miscinfo Ux_A.a_miscinfo 
#define ux_tsize Ux_A.a_text 
#define ux_dsize Ux_A.a_data 
#define ux_bsize Ux_A.a_bss 
#define ux_entloc Ux_A.a_entry 
#endif /* —_hp9000s300 */ 
caddr_t u_dirp; /* 
/* END TRASH */ 
struct TrHeaderT *u_trptr; /* 
int u_lcount ; /* 
int u_ldebug; /* 
int u_lck_keys [LOCK_TRACK_MAX] 
dev_t u_devsused; /* 
#ifdef —_hp9000s800 
u_int u_spare3[8]; ‘has 
int u_sstep; /* 
#define ULINK 0x01f /* 
#define USSTEP 0x020 /* 
#define UPCQM 0x040 /* 
#define UBL 0x080 Vs 
#define UBE 0x100 [* 
unsigned u_pcsq_head; /* 
unsigned u_pcoqg_head; /* 
unsigned u_pcsq_tail; 
unsigned u_pcoqg tail; 
unsigned u_ipsw; /* 
int u_gri; /* 
int u_gr2; /* 
#endif /* —_hp9000s800 */ 
struct uprof { Vs 
short *pr_base; L* 
unsigned pr_size; /* 
unsigned pr_off; /* 
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281 unsigned pr_scale; /* pe scaling */ 
282 } u_prof; 
283 #ifdef —_hp9000s800 
284 u_int u_kpreemptcnt; /* kernel preemption counter: */ 
285 /* read with GETKPREEMPTCNT() */ 
286 /* clear with CLRKPREEMPTCNT() */ 
287 /* incremented in kpreempt () * / 
288 #endif /* —_hp9000s800 */ 
289 dm_message u_request; /* request message*/ 
290 struct nsp *u_nsp;. /* nsp performing service*/ 
291 site_t u_site; /* site for which nsp executing */ 
292 int u_duxflags; /* see defines below */ 
293 char **u_cntxp; /* context pointer */ 
294 struct locklist *u_prelock; /* preallocated lock for lockadd() */ 
295 
296 struct ki_timeval u_syscall_time; /* system call timestamp */ 
297 dev_t u_devit; /* device location of this process */ 
298 ino_t  u_inode; /* inode number of this process */ 
299 int *ki_clk_tos_ptr; 
300 
301 #define KI_CLK_STACK SIZE 20 
302 int ki_clk_stack [KI_CLK_STACK_SIZE] ; 
303 . 
304 caddr_t u_vapor_mlist; /* linked list of vapor_malloc mem */ 
305 int u_ord_blk; /* last ordered write block */ 
306 #ifdef — hp9000s300 
; 307 struct pcb u_pcb; /* should be last except u_stack */ 
O 308 #endif /* __hp9000s300 */ , 
309 
310 union { /* double word aligned stack */ 
311 double s_dummy; 
312 int s stack[1] ; 
313 } wos; /* must be last thing in user_t */ 


314 #define u_stack u_s.s_stack 

315 } user_t; 

316 

317 /* 

318 * These two defines are moved (logically) from param.h. Need to have them 
319  * here to be able to get at sizeof (user_t) 


320 * / 

321 #ifdef —hp9000s800 

322 #define KSTACKBYTES 8192 /* size of kernel stack */ 
323 #define UPAGES btorp (sizeof (user_t) + KSTACKBYTES) 

324 #endif 

325 


326 struct ucred { 
327 #ifdef _CLASSIC_ID_TYPES 


328 unsigned short cr_filler_uid; 

329 unsigned short cr_uid; /* effective user id */ 
330 +#else 

33% uid) tex. uid> /* effective user id */ 


332 #endif 
333 #ifdef CLASSIC _ID TYPES 


334 unsigned short cr_filler_gid; 
O 335 unsigned short cr_gid; /* effective group id */ 
336 #else 


, fas 


O 
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gid_t cr_gid; /* effective group id */ 
#endif 
#ifdef _CLASSIC_ID TYPES 

int cr_groups [NGROUPS] ; /* groups, 0 terminated */ 
#else 

gid_t cr_groups [NGROUPS] ; /* groups, 0 terminated */ 
#endif 


#ifdef _CLASSIC_ID_ TYPES 

unsigned short cr_filler_ruid; 

unsigned short cr_ruid; /* real user id */ 
#else 

uid_t  cr_ruid; /* real user id */ 
#endif 
#ifdef _CLASSIC_ID_TYPES 

unsigned short cr_filler_rgid; 


unsigned short cr_rgid; /* real group id */ 
#else 
gid t cr_rgid; /* real group id */ 
#endif 
short cr_ref; /* reference count */ 
i 
#ifdef _KERNEL 
#define crhold (cr) { SPINLOCK (cred_lock) ; (cr) ->cr_ref++;SPINUNLOCK (cred_lo 


struct ucred *crget(); 
struct ucred *crcopy (); 
struct ucred *crdup() ; 
#endif /* _KERNEL */ 


/* u_eosys values */ 


#define EOSYS_NOTSYSCALL 0 /* not in kernel via syscall() */ 
#define EOSYS_NORMAL 1 /* in syscall but nothing notable */ 
#define EOSYS_ INTERRUPTED 2 /* signal is not yet fully processed * 
#define EOSYS_ RESTART 3 /* user has requested restart */ 
#define EOSYS_NORESTART 4 /* user has requested error return */ 
#define RESTARTSYS EOSYS_INTERRUPTED /* temporary!!! x] 

/* 

* defines for u_duxflags 

*/ 
#define DUX_UNSP 4 /* process is a user NSP */ 

/* u_error codes */ 
#include <errno.h> /* Traditional */ 


#if defined(__hp9000s800) && defined (_KERNEL) 

/* WARNING: NEVER, NEVER, NEVER use u as a local variable 
* name or as a structure element in I/O system or elsewhere in the 
* kernel. 
as 

#define u (*uptr) 

#define udot (*uptr) 

#endif /* _hp9000s800 && _KERNEL */ 
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I/O Overview 


Memory mapped I/0 
How I/O flows out of the system 
- uses the filesystem 


- uses the major number to go through the bdevsw/cdevsw 
tables to get to the appropriate driver 


- most of the work is done by the driver 
How I/O flows into the system 


- interrupt comes in from I/O card and is handled by 
the appropriate driver’s interrupt service routine 


- the driver may wake up sleeping processes, send out a 
new.command, or do whatever is appropriate 


Device drivers 
- provide the window to interface to the outside world 
- provide the hardware specific routines 


- provide a common interface to the kernel 


I/O Performance 


How I/0 


SE 390: Series 300 HP-UX Internals 


I/O Overview 


Flows Out of the System 


Background: we create a device file something like this: 
$ mknod /dev/tty03 c 1 0x0f0204 


This creates a special file for port #2 on a mux card, and 
Says that it is hardwired. 


I/O to/from devices is handled using the same semantics as 
normal files in the file system. Because of this, programs can 
pretend that devices are just like regular files. However, the 
filesystem does not know anything about particular devices; it 
must use the relevant drivers to access them... 


All I/O starts with accessing the filesystem (during the open). 
The "open" system call reads the device file’s inode and keeps 
the information for later use. The kernel will look at 

the major number and type (char vs block) fields in the inode 
to decide which driver to go through. It will also give the 
driver a chance to do any necessary device dependent operations 
(e.g. enable interrupts). 


To get to the right driver, the filesystem will use the type 
to choose a switch table (bdevsw or cdevsw), and the major 
number as an index into the chosen table. The operation it 
is performing (open, read, write, etc) tells it which element 
of the struct to use once it is there. | 


I/O Structure Overview 


pono cee nen eee eee eee + 
| user process | 
toe ce eee eee ee ee eee eee + 
peer eee eee eee ee eee eee + 
| filesystem | /-- bdevsw/cdevsw tables 
oe + 
besides sri Giiat oe edt eee / 
ton cee ee ee eee eee eee + 

top half USER 

of driver CONTEXT 
penne eee ee ee eee ee ee eee + 
Se + 

bottom half 

of driver 

tee rr re ee ree ee er ee ee eee + INTERRUPT 
+------------ eee eee + CONTEXT 


pee renentatn 
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Figure 4. HP 9000 Series 400 and 700 Memory Maps 
Design and Integration of Mixed-bus Systems on HP-UX Workstations f} HEWLETT 
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I/O Overview 


- The PAS from 0x600000 to 0x800000 is "external I/O space", and is 


where DIO-I cards are mapped. To figure out where a card will be 
mapped, multiply its select code by 64K and add that to 0x600000. 
The 64K starting at that address is available for the card to use. 


I/O space is scanned at boot time to see what devices are present. 
The boot rom does some of this, and prints out the list of cards 
it finds. The kernel does it again, in preparation for doing I/0 
later. Essentially all the kernel has to do is try to read from 
a particular address. If it gets a bus error, that means nothing 
is there. If it gets some data back, it will try to interpret 
that and figure out which card is there based on the value 
returned (the "ID byte" that cards are required to provide). 


When iomap(4) is used, it uses the minor number to calculate 
the appropriate address, and then calls System V shared memory 
routines to attach the user process’s virtual address space to 
the space for the card. 


Kernel Virtual Address Space DIO-I External I/O Space 
fe, 4GB +--------------- + 8 MB +--------------- + 
ay | RAM | | | 
O wo < < man < The space < 
& > > > from 6MB-8MB > 
mY < < od - < is split into < 
£ | | aero 64K chunks, 
512MB + 8MB +-~-------------- + nlbaf? one per 
DIO-I D: select code 
external 
I/O space 
512MB + 6MB +--------------- + 7 MB +--------------- + 


“ 
S 
S 
K 
\ AN) 
wy 
iN 


internal I/O _SC_14 (0x0e) _ 


space 
512 MB t----3-- sere + 
< > __The first 8 .__ 
> < _ of these are_ 
< > __kind of 
__ bogus; one___ 
__can’t 
32 MB aes Se ete ie eee as + __ necessarily _ 
__use them :-)_ 
big tables 
bss 6 MB +--------------- + 
data ; 
kernel code 
O MB +-------- rr eee + 
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I/O Overview 


' 68K Interrupt Handling 
- interrupt comes in from I/O card at card’s IL 
- IL indexes into _rupttable 


- Each entry in _rupttable is the head of a linked 
list of structures, one per card. They are in 
increasing order by select code, and look something 
like this: 


beh S | register addr | value to expect | ISR addr | 
Ho -- ee ee eee ee ee ee ee ee eee wae eee eee eee ee + 


- The kernel’s interrupt handler walks the list, asking 
¢ each card if it was the one that interrupted. This is 
Lies ene done by reading a register on the card and comparing 
‘ the value with what the driver said would be there if 
Aptos. C the card interrupted. 


- When the right card is identified, its device driver is 


called to process the interrupt (sending out a new command 
grabbing the data off of the card, etc). 


O | _rupttable 


low select code high select code 
Interrupt level +------------ + t---- ee ee eee + too - cere ee + 
as | |--->| |--->| |--->... 
toon reer eee + torre ee eee ee + te-- ee -ee eee + 
to -- eee eeeeee + 
2 | |--->. 
toot rrr cree + 
to-- eee + toon nee eee + 
3 | |--->| ==>... 
te- ccc eee ee + te orc ee cere ee + 
too r terre eee + terre rr cree + torr rrr ere eee + 
4 | |--->| |--->| |---> 
t---- ee eee + t---- eee ee eee + te-- er eee eee + 
t--- reo eneee + toe ee e eee + toe eee e ree + 
5 | aes joecksl Reine 
to --- eee + tonne eee eee + tooo cere eee + 
teer rence + 
6 | |--->... 
+o - eee eee nee + 
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I/O Overview 


5 6 oe Specifics 


- The top 256MB of physical address space is where PA-RISC thinks 


I/O space should be. Some interesting pieces of this space: 
Oxf£0820000 --> OxfOfffECf Core I/O (LAN, SCSI, HIL, etc) 


Ox£4000000 --> Oxf7f££FLFLE SGC slot 1 
Oxf8000000 --> Oxfbffffft SGC slot 2 (720 uses this one) 


OxfcO00000 --> Oxffbfffff EISA 


When an interface needs to interrupt, its bit in a dedicated 
register is set, and the CPU will notice this; note that 
there is no need to *figure out* who interrupted since each 
interface has a dedicated bit. 


Devices have no settable "interrupt priorities"; it is up to 
the software to decide what to service first. Here’s the 
order the software uses as of 8.05: 


bus errors (shouldn’t happen) 

EISA 

graphics (doesn’t often happen) 

SCSI 

LAN 

parallel 

serial | 

HIL (people are slow peripherals :-) 


The cards/adapters tend to have "smart DMA" on them: 


- SCSI uses NCR chip that has a script processor; 
this maximizes disk throughput and minimizes the 
need for CPU intervention because the driver can 
build a whole chain of commands and then point 
the script processor at them 


- The LAN interface has a 128-byte inbound buffer and 64-byte 


outbound one. Each of the 2 RS232s has a 16-byte buffer 
for inbound and another for outbound traffic. 


- EISA converter is basically a window between EISA 


cards and the rest of the box 
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I/O Overview 
Types of Drivers 


- block mode 


- usually associated with the filesystem, and Roane with 
blocks of data of the same size 


- used with random access devices 

- almost *always* use DMA 

- shields user from hardware details (like disk sector 
size; a disk doesn’t want any requests that aren’t 
a multiple of its hardware sector size) 

- character mode 
- usually sequential devices (e.g. printers, tapes) 
- deals with "variable" lengths of data 


- character mode does not mean it deals only 
with "characters" 


- may use DMA transfers, or may be solely CPU 
(interrupt) transfers 


- may be *very* similar to block-mode driver (e.g. 
"raw" and "block" CS80 share about 90% of their code) 


- Device drivers don’t have to have hardware associated 
with them; they are a general mechanism for extending 
the kernel. 


How Is A Driver Configured? 


Note: the config(1lm) and master(4) manpages are good references. 


/etc/master contains the information on drivers. There are two 
types of "driver" entry. There is the upper-level (device) 
drivers (e.g. cs80, tty, etc) and the lower-level (interface or 
card) drivers (e.g. parallel). Some drivers may combine both, 
as in the SCSI driver. 


The driver information in /etc/master tells "config" what entries 
to put in the conf.c file (which will in turn make the linker 
do most of the work). Here are some lines from /etc/master: 


* name handle type mask block char 
* 

cs8s0 cs80 3 3FB 0 4 
tape tp 1 FA zs 5 
ramdisc ram 3 FB 4 20 
98624 ti9914 10 100 -1 -1 
98625 simon 10 100 -1 -1 
98628 sio628 10 100 -1 -1 
98642 sio642 10 100 -1 -1 
* 14 

tty sy D FD =a 2 


A description of the fields are: 


name - the name used in the "dfile" for this driver 
handle - the "handle" actually used in the kernel (e.g. the 
tty driver’s open routine is sy_open) 
type - 5-bit attribute flag indicating "type" of driver: 
43210 
| \- character device 
\--- block device 


\orre- required driver 
\--rr cree specified only once 
\--cccc cc card 


mask - 10-bit driver routine flag; tells config what routines to 
include in conf.c for the driver 
876543210 
| \- C_ALLCLOSES flag 
\--- seltrue handler (select is always TRUE) 
\rccre select handler 
\rcercce- ioctl handler 
i write handler 
Ve read handler 
\--rccccrce-e -- close handler 
\- crc rrr creer eee open handler 
rr link routine (links interrupt handler; 
found in all interface drivers) 
\c crt ctr crc rrr errr size handler (in disc-type drivers) 
block - major number for block device driver 
char - major number for character device driver 


The major (or driver) number indicates the array offset for the 
routine entries in a device switch table. 


ENN RTT RAR ERNST NCOR ACEO W ED NTU MTEC EEC TSTFIE TN AEIACIN RONEN TRCN TTT WEP 


Examples from conf.c for the routines "brought in" by the "type" 
and "mask" values above are as follows: 


extern cs80_open(), cs80_close(), cs80_read(), cs80_write(), 
cs80_ioctl(), cs80_size(), cs80_link(), cs80_strategy(); 

extern sy open(), sy_close(), sy_read(), sy_write(), sy ioctl() 
sy_select(); 


extern ti9914 link(); 


Following are exerpts from the bdev/cdev switch tables. It is 
via these two tables that the proper subroutine calls are made 
for the appropriate driver. By modifying /etc/master’s driver 
numbers, you can change the "major" numbers :-) 


struct bdevsw bdevsw[] = { 
/* 0*/ cs80_open, cs80_close, cs80_ strategy, cs80_ size, C_ALLCLO 
/* 1*/ nodev, nodev, nodev, nodev, 0, 


y3 


struct cdevsw cdevsw[] = { 


/* 2*/ sy_open, sy_close, sy_read, sy_write, sy_ioctl, sy_select 
C ALLCLOSES, 


/* 4*/ cs80_open, cs80 close, cs80_ read, cs80 write, cs80_ioctl, 
seltrue, C_ALLCLOSES, 


/*43*/ nodev, nodev, nodev, nodev, nodev, nodev, 0, 


1. 
fia 


This structure is used during the startup to allow for linking of 
"make _entry" routines for the drivers. 


The make_entry() routine for each driver is called during startup 
of the system. For each card found during bootup, the kernel 
calls the make _ entry routine. These routines check to see if 

the card is theirs. If so, it may perform some initialization 
and it reports finding the card. If not, the make_entry() 
routine will call the next driver's make_entry(). There is 
always a dummy routine at the end of the list that will report 

no driver found for the card. 


int (*driver_link[])() = 


cs80_link, 
amigo link, 
scsi_link, 
graphics_link, 
ptys_link, | 


sio628 link, 
sio642 link, 
ite200_ link, 
(int (*)())0 


co 
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*dskless 
nipe 
netman 

ni 

inet 

lla 

lanol 

cs80 

scsi 
scsitape 
tape 

stape 
printer 
ptymas 
ptysliv 
hpib 

98624 
98625 
98626 
98628 
98642 

uipc 

nbuf 1024 
nproc 256 
ninode 1000 
nfile 1000 
Swap auto 
Swap scsi £0500 -1 


C 
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/* 


* Configuration information 


ed 


#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
int 

#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 


MAXUSERS 8 
TIMEZONE 420 

DST 1 

NPROC 256 

NUM_CNODES ( (5*SERVER_NODE) +DSKLESS_NODE) 
DSKLESS_NODE 0 
SERVER_NODE 0 

NINODE 1000 

NFILE 1000 

FILE PAD 10 
MAXFILES 60 
MAXFILES LIM 1024 

NBUF 1024 

FS_ASYNC 0 

DOS_MEM BYTE 0 
NCALLOUT (16+NPROC+USING ARRAY SIZE+SERVING_ARRAY SIZE) 
UNLOCKABLE_MEM 102400 
NFLOCKS 200 

NPTY 82 

MAXUPRC 50 

MAXDSIZ 0x01000000 

MAXSSIZ 0x00200000 

MAXTSIZ 0x01000000 

PARITY OPTION 2 

REBOOT OPTION 1 
TIMESLICE 0 
ACCTSUSPEND 2 
ACCTRESUME 4 
NDILBUFFERS 30 
FILESIZELIMIT Oxlfffffff 


USING ARRAY SIZE 
SERVING ARRAY SIZE 
DSKLESS_FSBUFS 


(NPROC) 
(SERVER_NODE*NUM_CNODES*MAXUSERS+2*MAXUSERS) 
(SERVING_ARRAY_SIZE) 


SELFTEST_PERIOD 120 


INDIRECT PTES 


1 


indirect_ptes INDIRECT _PTES; 
CHECK _ALIVE_PERIOD 4 
RETRY_ALIVE_PERIOD 21 
MAXSWAPCHUNKS 512 
MINSWAPCHUNKS 4 

NSWAPDEV 10 

NSWAPFS 10 


NUM_LAN CARDS 


2 


NETISR_PRIORITY -1 


NGCSP  (8*NUM_CNODES) 
NNI 1 
SCROLL_LINES 100 
NUM_PDNO = 
MESG 1 

MSGMAP (MSGTQL+2) 


MSGMAX 8192 


(/ 
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#define MSGMNB 16384 

#define MSGMNI 50 

#define MSGSSZ 1 

#define MSGTQL 40 

#define MSGSEG 16384 

#define SEMA 1 

#define SEMMAP (SEMMNI+2) 

#define SEMMNI 64 

#define SEMMNS 128 

#define SEMMNU 30 

#define SEMUME 10 

#define SEMVMX 32767 

#define SEMAEM 16384 

#define SHMEM 1 

#define SHMMAX 0x00600000 

#define SHMMIN 1 

#define SHMMNI 30 

#define SHMSEG 10 

#define FPA 1 

#define SWAPMEM ON 0 

#define SWCHUNK 2048 

#define UIPC 

#define UIPC 

#define NIPC 

#define INET 

#define INET 

#define NI 

#define LANO1 

#include "/etc/conft/h/param.h" 
#include "/etc/conf/h/systm.h" 
#include "/etc/conf/h/tty.h" 
#include "/etc/conft/h/space.h" 
#include "/etc/conf/h/opt.h" 
#include "/etc/conf/h/conf.h" 
#define ieee802_open lan_open 
#define ieee802_ close lan_close 
#define ieee802 read lan_read 
#define ieee802_ write lan_write 
#define ieee802_link lan_link 
#define ieee802_select lan_select 
#define ethernet_open lan_open 
#define ethernet_close lan_close 
#define ethernet_read lan_read 
#define ethernet_write lan_write 
#define ethernet_link lan_link 
#define ethernet_select lan_select 
#define hpib_link gpio_link 
#define lla_link lan_link 
#define lan0l1_link lan_link 
extern nodev(), nulldev(); 

extern seltrue(), notty(); 


IL 
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extern 
extern 
extern 
extern 
extern 
extern 
extern 
extern 
extern 
extern 
extern 
extern 
extern 
extern 
extern 
extern 
extern 
extern 
extern 
extern 
extern 
extern 
extern 
extern 
extern 
extern 


extern 
extern 
extern 
extern 
extern 
extern 
extern 
extern 
extern 
extern 


struct 


fE0% / 
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cs80_open(), cs80_close(), cs80_read(), cs80_write(), cs80_ioctl(), cs8 
swap_strategy () ; 

swapl_strategy () ; 

scsi_open(), scsi_close(), scsi_read(), scsi_write(), scsi_ioctl(), scs 
cons_open(), cons_close(), cons_read(), cons_write(), cons_ioctl(), con 
tty_open(), tty_close(), tty_read(), tty_write(), tty_ioctl(), tty_sele 
sy_open(), sy_close(), sy_read(), sy_write(), sy_ioctl(), sy_select(); 
mm_read(), mm_write () ; 

tp_open(), tp_close(), tp_read(), tp_write(), tp_ioctl(); 

lp_open(), lp_close(), lp_write(), lp_ioctl1(); 

swap_read(), swap_write(); 

stp_open(), stp_close(), stp_read(), stp_write(), stp_ioctl(); 
iomap_open(), iomap_close(), iomap_read(), iomap_write(), iomap_ioctl() 
graphics_open(), graphics_close(), graphics_ioctl(), graphics_link() ; 
ptym_open(), ptym_close(), ptym_read(), ptym_write(), ptym_ioctl(), pty 
ptys_open(), ptys_close(), ptys_read(), ptys_write(), ptys_ioctl(), pty 
lla_open(), lla_link() ; 

lla_open(); 

hpib_open(), hpib_close(), hpib_read(), hpib_write(), hpib_ioctl1(); 
r8042_open(), r8042_close(), r8042_ioctl() ; 

hil_open(), hil_close(), hil_read(), hil_ioctl(), hil_select(), hil_lin 
nimitz_open(), nimitz_close(), nimitz_read(), nimitz_select () ; 
scsitape_open(), scsitape_close(), scsitape_read(), scsitape_write(), s 
ni_open(), ni_close(), ni_read(), ni_write(), ni_ioctl(), ni_select(), 
audio_open(), audio_close(), audio_read(), audio_write(), audio_ioctl () 
nm_open(), nm_close(), nm_read(), nm_ioctl(), nm_select() ; 


nipc_link() ; 
inet_link() ; 
uipc_link(); 
scsi_if_link() ; 
ti9914_ link(); 
simon_link () ; 
sio0626_link() ; 
si0628 link(); 
si0642_link() ; 
ite200_link(); 


bdevsw bdevsw[] = { 
{cs80_open, cs80_close, cs80_strategy, cs80_ dump, cs80_ size, C_ALLCLOS 


/* 
/* 
/* 
/* 
/* 
/* 
/* 
}i 


1*/ 
2%] 
3*/ 
4* / 
5*/ 
6*/ 
7* / 


struct 


/* 
/* 
/* 
/* 
/* 
/* 


O*/ 
1*/ 
2*/ 
3*/ 
4* / 
5*/ 


{nodev, nodev, nodev, nodev, nodev, 0, nodev}, 
{nodev, nodev, nodev, nodev, nodev, 0, nodev}, 
{nodev, nodev, swap_strategy, nodev, 0, 0, nodev}, 
{nodev, nodev, nodev, nodev, nodev, 0, nodev}, 
{nodev, nodev, swapl_strategy, nodev, 0, 0, nodev}, 
{nodev, nodev, nodev, nodev, nodev, 0, nodev}, 


{scsi_open, scsi_close, scsi_strategy, scsi_dump, scsi_size, C_ALLCLOS 


cdevsw cdevsw [] 


= { 


{cons_open, cons_close, cons_read, cons_write, cons_ioctl, cons_select 
{tty_open, tty_close, tty_read, tty_write, tty_ioctl, tty_select, C_AL 
{sy_open, sy_close, sy_read, sy_write, sy_ioctl, sy_select, C_ALLCLOSE 


{nulldev, 


nulldev, mm_read, mm_write, notty, seltrue, 0}, 


{cs80_open, cs80_close, cs80_read, cs80_write, cs80_ioctl, seltrue, C_ 
{tp_open, tp_close, tp_read, tp_write, tp_ioctl, seltrue, 0}, 


7 
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[= 67 
/* 7*/ 
/* 8*/ 
fe 9*/ 
/*10*/ 
/*11*/ 
f= 12%] 
{*13*/ 
/*14*/ 
/*15*/ 
/*L6*/ 
[ALT] 
/*18*/ 
[*IO*/ 
/*20*/ 
f*21*/ 
/*22*/ 
/*23*/ 
[*24*/ 
/*25*/ 
/*26*/ 
[=278 
/*28*/ 
/*29*/ 
/*30*/ 
{*31*/ 
[*32*/ 
ae | 
[*34* / 
/*35*/ 
/*36*/ 
/*37* / 
/*38* / 
/*39*/ 
/*40*/ 
/*41*/ 
[*42*/ 
[*43*/ 
[*44*/ 
/*45*/ 
[*46*/ 
[*47* / 
/*48*/ 
[*49* / 
/*50*/ 
/*51*/ 
fP52*/ 
/*53*/ 
[*See/ 
{*55*/ 
[*S6¥/ 
p57 */ 
/*58*/ 
fF5g*? 
7*60* / 


}i 
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{nodev, nodev, nodev, nodev, nodev, nodev, 0}, 
{lp_open, lp_close, nodev, lp write, lp_ioctl, seltrue, 0}, 


{nulldev, 


nulldev, swap_read, swap_write, notty, nodev, 0}, 


{stp_open, stp_close, stp_read, stp write, stp_ioctl, seltrue, 0}, 
{iomap_open, iomap_close, iomap_read, iomap_write, iomap_ioctl, nodev, 
{nodev, nodev, nodev, nodev, nodev, nodev, 0}, 
{graphics open, graphics close, nodev, nodev, graphics ioctl, nodev, C 
{nodev, nodev, nodev, nodev, nodev, nodev, 0}, 
{nodev, nodev, nodev, nodev, nodev, nodev, 0}, 
{nodev, nodev, nodev, nodev, nodev, nodev, 0}, 
{ptym_open, ptym_close, ptym_read, ptym_write, ptym_ioctl, ptym_select 
{ptys_open, ptys_close, ptys_read, ptys_write, ptys_ioctl, ptys_select 
{lla_open, nulldev, nodev, nodev, notty, nodev, C_ALLCLOSES}, 
{lla_open, nulldev, nodev, nodev, notty, nodev, C_ALLCLOSES}, 

{nodev, nodev, nodev, nodev, nodev, nodev, 0}, 
{hpib_open, hpib close, hpib_read, hpib_write, hpib_ioctl, seltrue, C_ 
{nodev, nodev, nodev, nodev, nodev, nodev, 0}, 
{r8042_ open, r8042_ close, nodev, nodev, r8042_ioctl, nodev, 0}, 
{hil_open, hil_close, hil_read, nodev, hil_ioctl, hil_select, 0}, 
{nimitz_open, nimitz_ close, nimitz_read, nodev, notty, nimitz_select, 


{nodev, nodev, nodev, nodev, 
{nodev, nodev, nodev, nodev, 
{nodev, nodev, nodev, nodev, 
{nodev, nodev, nodev, nodev, 
{nodev, nodev, nodev, nodev, 
{nodev, nodev, nodev, nodev, 
{nodev, nodev, nodev, nodev, 
{nodev, nodev, nodev, nodev, 
{nodev, nodev, nodev, nodev, 
{nodev, nodev, nodev, nodev, 
{nodev, nodev, nodev, nodev, 
{nodev, nodev, nodev, nodev, 
{nodev, nodev, nodev, nodev, 
{nodev, nodev, nodev, nodev, 
{nodev, nodev, nodev, nodev, 
{nodev, nodev, nodev, nodev, 
{nodev, nodev, nodev, nodev, 
{nodev, nodev, nodev, nodev, 
{nodev, nodev, nodev, nodev, 
{nodev, nodev, nodev, nodev, 
{nodev, nodev, nodev, nodev, 
{scsi_open, scsi_close, scsi_ 
{nodev, nodev, nodev, nodev, 
{nodev, nodev, nodev, nodev, 
{nodev, nodev, nodev, nodev, 
{nodev, nodev, nodev, nodev, 
{nodev, nodev, nodev, nodev, 
{nodev, nodev, nodev, nodev, 


nodev, nodev, 0}, 
nodev, nodev, 0}, 
nodev, nodev, 0}, 
nodev, nodev, 0}, 
nodev, nodev, 0}, 
nodev, nodev, 0}, 
nodev, nodev, 0}, 
nodev, nodev, 0}, 
nodev, nodev, 0}, 
nodev, nodev, 0}, 
nodev, nodev, 0}, 
nodev, nodev, 0}, 
nodev, nodev, 0}, 
nodev, nodev, 0}, 
nodev, nodev, 0}, 
nodev, nodev, 0}, 
nodev, nodev, 0}, 
nodev, nodev, 0}, 
nodev, nodev, 0}, 
nodev, nodev, 0}, 
nodev, nodev, 0}, 
read, scsi_write, 
nodev, nodev, 0}, 
nodev, nodev, 0}, 
nodev, nodev, 0}, 
nodev, nodev, 0}, 
nodev, nodev, 0}, 
nodev, nodev, 0}, 


{scsitape_open, scsitape_close, scsitape_read, 
{nodev, nodev, nodev, nodev, nodev, nodev, 0}, 
{ni_open, ni_close, ni_read, ni_write, ni_ioctl, ni_select, 0}, 

{audio_open, audio_close, audio _read, audio_write, audio_ioctl, audio_ 
{nodev, nodev, nodev, nodev, nodev, nodev, 0}, 
{nodev, nodev, nodev, nodev, nodev, nodev, 0}, 
{nm_open, nm_close, nm_read, nodev, nm_ioctl, nm_select, 0}, 


scsi_ioctl, seltrue, C_ 


scsitape_write, scsitap 
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int nblkdev = sizeof (bdevsw) / sizeof (bdevsw[0]); 
int nchrdev = sizeof (cdevsw) / sizeof (cdevsw([0]) ; 
dev_t ' rootdev = makedev(-1,0xFFFFFF) ; 


/* The following three variables are dependent upon bdevsw and cdevsw. If 
either changes then these variables must be checked for correctness */ 


dev_t swapdevl = makedev(5, 0x000000) ; 


int brmtdev = 6; 
int crmtdev = 45; 
struct swdevt swdevt[] = { 


{ SWDEF, 0, -1, 0 }, 
makedev(7, Ox0f0500), 0, -1, 0O }, 
NODEV, 0, 
NODEV, 0 
NODEV, 0 
NODEV, 0, 
NODEV, 0, 
0) 
0 
0 


~ ~ 


' 
i 
i 


~ 


-~ ~ 


NODEV, 
NODEV, 
NODEV, 


~ ~ 


ooo oOo 000 


PSPS PAR PS EPR PAS PAN CAS AN 


~ 


dev_t dumpdev = makedev(-1,0xFFFFFF) ; 
int (*driver_link[])() = 


cs80_link, 
scsi_link, 
graphics_link, 
ptys_link, 
lla_link, 
hil_link, 
ni_link, 
audio_link, 
nipc_link, 
inet_link, 
uipc_link, 
scsi_if link, 
ti9914 link, 
simon_link, 
si0626_link, 
si0628 link, 
sio642_link, 
ite200_link, 
(int (*) ())0 

le 

char dfile_data[] = "\ 

nipc\n\ 

netman\n\ 

ni\n\ 

inet \n\ 

lla\n\ 
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lan0di\n\ 
cs80\n\ 
scsi\n\ 
scsitape\n\ 
tape\n\ 
stape\n\ 
printer\n\ 
ptymas\n\ 
ptyslv\n\ 
hpib\n\ 
98624\n\ 
98625\n\ 
98626\n\ 
98628\n\ 
98642\n\ 
uipc\n\ 

nbuf 1024\n\ 
nproc 256\n\ 
ninode 1000\n\ 
nfile 1000\n\ 
swap auto\n\ 
Swap scsi £0500 -1\n\ 
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## 

## HP-UX System Makefile 
## 

# .SILENT 


STDDEFS=-Dhp9000s200 -D__hp9000s200 -D__hp9000s300 -Dhpux -D HPUX SO 
IDENT=-D KERNEL -DKERNEL -Uvax -DHFS -DMC68030 -DPSTAT -DSAVECORE_30 

“DREGION -DKVM -DGENESIS -DAUTOCHANGER -DEISA -DWRITE GUARD_ 
REALTIME = -DRTPRIO -DPROCESSLOCK -DEISA 


Cc = /bin/cc 
AS = /bin/as 
LD = /bin/ld 


SHELL = /bin/sh 
ROOT = /etc/conf 


LIBS = \ 
$ (ROOT) /libuipc.a \ 
$ (ROOT) /libnipc.a \ 
$ (ROOT) /liblan.a \ 
$ (ROOT) /libinet.a \ 
$ (ROOT) /libnet.a \ 
$ (ROOT) /libkreq.a \ 
$ (ROOT) /libdreq.a \ 
$(ROOT)/libpm.a \ 
$ (ROOT) /libvm.a \ 
$ (ROOT) /libsysvV.a \ 
$ (ROOT) /libmin.a \ 
$ (ROOT) /libdevelop.a \ 
$ (ROOT) /libdil srm.a \ 
$ (ROOT) /libkern.a \ 
$ (ROOT) /libk.a 


CFLAGS= +M -Wc,-Nd3500,-Ns7000 -Wp,-H250000 -I. 
COPTS= $(STDDEFS) $(IDENT) $ (REALTIME) 

KREQ1 OBJS= exceptions.o locore.o vers.o 

KREQ2 OBJS= name.o funcentry.o cdfs_hooks.o 
DEBUG _OBJS= debug.nms.o 


all: hp-ux 


hp-ux: conf.o 
mm -f hp-ux 
ar x $(ROOT)/libkreq.a $(KREQ1_ OBJS) $(KREQ2_OBJS) 
@echo ‘Loading hp-ux...’ 
$(LD) -n -o hp-ux -e _start -x \ 
$(KREQ1 OBJS) conf.o $(KREQ2 OBJS) $(LIBS) 
rm -£ $(KREQ1 OBJS) $(KREQ2_OBJS) 
chmod 755 hp-ux 


conf.o: conf.c 
rm -f£ conf.o 
@echo ‘Compiling conf.c ...’ 
$(CC) S(CFLAGS) $(COPTS) -c conf.c 


if 


SE 390: Series 300 HP-UX Internals 


I/O Overview 


0 


I/O Performance 


- DMA 
- 300 and 400 each have two DMA channels 


- 700 has a DMA channel for most any interface that 
needs it 


- As of 9.0, the 700 will schedule I/O based on 3 things: 
- how long the request has been waiting 
- disk latency (seek, rotational delay, etc) 
- priority of the requesting process 


- Measurement 


- use iostat(1); if it just won’t do the job, you 
can monitor the structures it uses: 


- tk_nin, tk_nout count characters going in and 
out of the system via ttys 


- dk_*[] arrays - for each of 8 devices, 


© dk_time[i] tells how much time this drive 
has been active 


dk_seek[i] tells how many seeks this 
drive has done 


dk_xfer[i] tells how many data transfers 
this drive has done 


dk_wds[i] tells how many 64-byte "words" 
this drive has read/written 


dk_mspw[i] tells how many milliseconds 
per "word" it has taken 


there is a bit in dk_busy indicating 
whether this drive is doing something 
at the moment 


\o 


RAMdisk Open 


An open routine typically performs some driver specific operations. It 
may be a driver that supports exclusive open (only one open at a time), 
so returns an error for any additional opens. It may allocate buffer 


Space (if not already allocated). Also, it may perform card reset (e.g. 


the gpio card). 


The RAM driver will allocate memory if it is the first open (that is, 
there is presently no memory allocated for it). The open also ensures 
the requested device is in the range (and size) of the driver. The 
information on the device (drive number and size) is packed into the 
minor number. The macros in ram.h are written to pull out the 
pertinent information. The kernel provides similar type macros for 
extracting major, minor, selcode, volume, & unit numbers from the 
"dev" value passed to the driver. The major and minor number are 


_ packed into the 32 bit value, with 8 bits for major number and 24 bits 


for the minor number. 


/* max ram volumes cannot exceed 16 */ 
#define RAM MAXVOLS 16 


/* io mapping minor number macros */ 
/* up to 1048575 - 256 byte sectors */ 


#define RAM SIZE (x) ((x) & Oxffftff) /* Xxx */ 


/* up 16 disc allowed */ 


#define RAM DISC (x) (((x) >> 20) & Oxf) /* XXX */ 


#define RAM MINOR (x) ((x) & Oxf£ffff) /* XXX */ 
#define LOG2SECSIZE 8 /* log2 of the "sector" size (256 bytes) */ 


struct ram_descriptor { 


char *addr; /* "disc space" in RAM */ 
int size; /* size of RAM disc * / 
short opencount ; /* number of opens * / 
short flag; 

int rdlk; /* Stats for 1k reads * / 
int rd2k; /* Stats for 2k reads */ 
int rd3k; /* Stats for 3k reads * / 
int rd4k; /* Stats for 4k reads * / 
int rd5k; /* Stats for 5k reads * / 
int rd6ék; /* Stats for 6k reads * / 
int ra7k; /* Stats for 7k reads */ 
int rdsk; /* Stats for 8k reads * / 
int rdother; /* Stats for other reads */ 
int wtl1k; /* Stats for 1k writes */ 
int wt2k; /* Stats for 2k writes */ 
int wt3k; /* Stats for 3k writes */ 
int wt4k; /* Stats for 4k writes */ 
int wt5k; /* Stats for 5k writes */ 
int wt6k; /* Stats for 6k writes */ 
int wt7k; /* Stats for 7k writes */ 
int wt 8k; /* Stats for 8k writes */ 
int wtother; /* Stats for other writes */ 


} ram_device [RAM_MAXVOLS] ; 


** Open the ram device. 


*/ 
ram_open (dev, flag) 
dev_t dev; 
int flag; 
{ 
register unsigned long size; 
register struct ram_descriptor *ram_des_ptr; 
/* check if this is status open */ 
if (RAM MINOR(dev) == 0) 
return (0) ; 
/* check if this device is greater than max number of volumes */ 
if ((size = RAM DISC(dev)) > RAM MAXVOLS) 
return (EINVAL) ; 
ram_des_ptr = &ram_device [size] ; 
/* check the size of the ram disc less than 16 sectors */ 
if ((size = RAM SIZE(dev)) < 16) 
return (EINVAL) ; 
/* check if already allocated */ 
if (ram_des_ptr->addr != NULL) { 
/* then check if size changed; must be the same size */ 
if (ram_des_ptr->size != size) 
return (EINVAL) ; 
/* bump open count */ 
ram_des_ ptr->opencount++; 
} else { 
/* allocate the memory for the ram disc */ 
if ((ram_des_ptr->addr = 
(char *) sys_memall (size<<LOG2SECSIZE)) == NULL) { 
return (ENOMEM) ; 
} 
/* save size in 256 byte "sectors" *«/ 
ram_des_ ptr->size = size; 
/* open count should be zero */ 
if (ram_des_ptr->opencount++) { 
panic("ram_open count wrong)\n") ; 
} 
return (0) ; 
} 
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RAMdisk Read/Write routines 


| This is a "typical" read & write routine for drivers that have a block 
: driver as well, or that will use a common read/write "strategy" routine 


te and buffer headers. 


The physio() routine will take the information from 


the uio and dev variables and construct a buf structure that contains 
the information necessary for the strategy routine to perform the I/O. 
Physio() will break up the transfers into small enough transfers for the 
strategy routine to handle. The parameters to physio() are: 


strategy 


bp 


dev 
rw 


mincnt 


uio 


address of the strategy() routine physio will call 


pointer to a buf structure for physio to use; if 
NULL, physio will get one from the buffer cache 


the packed device info obtained when device opened 
either B_READ or B WRITE, indicating transfer type 
address of mincnt() routine, a routine that 
determines the max transfer size (usually the 
kernel-provided minphys() (xfer size = 64k) 

uio structure containing info about the user and 


the I/O request (size & direction of transfer, 
pointers to user’s buffers for the I/O, etc.) 


In the RAM disk driver, the read & write routines have the physio () 
routine request a buf structure from the file system’s buffers. It uses 


to a maximum of 64k transfers. 


Or kernel’s minphys() routine, so strategy will break up the transfers 


ram_read(dev, uio) 


dev_t dev; 
struct uio *uio; 


return physio(ram_ strategy, NULL, dev, B_READ, minphys, uio); 


ram write (dev, uio) 


dev_t dev; 
struct uio *uio; 


return physio(ram_strategy, NULL, dev, B_WRITE, minphys, uio); 


RAMdisk Strategy 


_ This routine will actually perform the "I/O" to the RAM disc. The buf 


structure passed to the strategy routine contains the necessary 
information for the transfer. This info is filled in by kernel 
routines; in the case of a character device, physio() does this, and 
for block devices, the filesystem takes care of filling in the data. 


ram_strategy (bp) 
register struct buf *bp; 
{ 
register block_d7; 
register char *addr; 
register struct ram_descriptor *ram_des_ptr; 


/* if this is a status request, return ram_device structure */ 
if (RAM_MINOR(bp->b_dev) == 0) { 
if ((bp->b_flags & B_PHYS) && /* must be char dev */ 
(bp->b_flags & B_READ) && 
(bp->b_bcount == sizeof (ram_device))) { 
bp->b_resid = bp->b_bcount; 


/* veturn the "ram_device" structure */ 
bcopy (&ram_device[0], bp->b_un.b_addr, 
sizeof (ram_device) ) ; 
} else { 
bp->b_error EIO; 
bp->b_flags = B_ERROR; 


} 


goto done; 


/* do the normal reads and writes to ram disc */ 
ram_des ptr = &ram_device [RAM DISC (bp->b_dev)] ; 


/* sanity check if we got the memory */ 

if ((addr = ram_des_ptr->addr) == NULL) { 
panic("no memory in ram_strategy\n") ; 

} 


/* make sure the request is within the size of the "disk" */ 
if (bpcheck (bp, ram_des_ptr->size, LOG2SECSIZE, 0)) 


return; 


/* calculate address to do the transfer */ 
addr += bp->b_un2.b_sectno<<LOG2SECSIZE; 


/* for debugging file system only */ 
block_d7 = bp->b_un2.b_sectno>>2; 


LD 


if iARe: >b_ flags & B_READ) { 
beopy (addr, bp->b_un.b addr, 


switch (bp->b | beount /1024) { 
1: 


case 


case 


case 


case 


case 


case 


case 


case 


2: 


3: 


4: 


5: 


6: 


Te 


8: 


ram_des_ptr->rd1lk++; 
break; 
ram_des_ ptr->rd2k++; 
break; 
ram_des_ptr->rd3k++; 
break; 
ram_des_ ptr->rd4k++; 
break; 
ram_des_ptr->rd5k++; 
break; 
ram_des_ptr->rd6k++; 
break; 
ram_des_ptr->rd7k++; 
break; 
ram_des_ptr->rd8k++; 
break; 


bp->b_bcount) ; 


default: ram_des_ptr->rdother++; 


} 
} else { /* WRITE */ 
bcopy (bp->b_un.b addr, addr, 


switch (bp->b_bcount/1024) { 
“eS 


case 


case 


case 


case 


case 


case 


case 


case 


default: 


i 


done: 


2: 


3: 


4: 


5: 


6: 


7: 


8: 


ram_des_ptr->wt1lk++; 
break; 
ram_des ptr->wt2k++; 
break; 
ram_des_ptr->wt3k++; 
break; 
ram_des_ptr->wt4k++; 
break; 
ram_des_ ptr->wt5k++; 
break; 
ram_des_ ptr->wt6k++; 
break; 
ram_des_ptr->wt7k++; 
break; 


ram_des_ptr->wt8k++; 


break; 


bp->b_bcount) ; 


ram_des_ptr->wtother++; 


bp->b_resid -= bp->b_bcount; 


biodone (bp) ; 


i) 
UN 
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RAMdisk Ioctl 


The ioctl routine: 


executed via ioctl (2); 


purpose: 
handles commands passed to it via ioctl 


implement the various ioctls by including statements of the 


following form: 
#define CMD task(t, n, arg) 


where: 
CMD command name 
t arbitrary letter 
n sequential number (unique for each ioctl define for a 


given ioctl routine) 
arg optional arg for command 


"task" (a macro defined in sys/ioctl.h) is one of 


_tO no arg 


_IOR user reads info from the driver into arg 
_IOW user writes info to driver from data in (or pointed 


to by) arg 
_IOWR. both _IOR and _IOW 


' There are two ioctl’s defined for the ramdisk driver. 


/* ioctl to deallocate ram volume */ 
#define RAM DEALLOCATE _IOW(R, 1, int) 


/* ioctl to reset the access counter to ram volume */ 


#define RAM RESETCOUNTS _IOW(R, 2, int) 


ram_ioctl(dev, cmd, addr, flag) 
dev_t dev; 


int cmd; 


caddr_t addr; 
int flag; 


{ 


register struct ram_descriptor *ram_des_ptr; 
register volume; 


/* check if dev is the status dev */ 
if (RAM_MINOR(dev) != 0) 
return (EIO) ; 


/* check if O - 15 disc volume */ 

volume = *(int *) addr; 

if ((volume % RAM MAXVOLS) != volume) 
return (EIO) ; 


/* calculate which ram volume it is */ 
ram_des ptr = &ram_device [volume] ; 


/* if not allocated, then return error */ 
if (ram_des ptr->addr == NULL) { 

return (ENOMEM) ; 
} 


24 


switch(cmd) { 


ram_des_ptr->flag 
break; 


ram_des_ptr->rd8k 


/* mark for memory release on last close */ 
case RAM DEALLOCATE: 
= RAM RETURN; 


/* clear out access counts */ 
case RAM RESETCOUNTS: 


ram_des_ptr->rd7k = 


ram_des_ptr->rd6k 
ram_des_ptr->rd5k 


ram_des ptr->rd4k = 
ram_des_ptr->rd3k = 


ram_des_ptr->rd2k 
ram_des_ptr->rdik 


ram_des_ptr->rdother 


ram_des_ptr->wt8k 


ram_des_ptr->wt7k = 


ram_des_ptr->wt6ék = 
ram_des_ptr->wt5k = 
ram_des_ptr->wt4k = 


_ ram_des_ ptr->wt3k 
ram_des_ptr->wt2k 
ram_des_ptr->wtik 


ram_des_ptr->wtother 


break; 


return (EIO) ; 


return (0) ; 
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RAMdisk Close 


The close routine may typically perform some driver specific operations. 
It may flush buffers if the device supports asyncronous I/O (e.g. tty 
driver). It will usually decrement an "open" counter and may release 
I/O buffers, etc. on close. 


The RAM disk driver just decrements an open count and releases memory on 
last close iff the RAM_RETURN flag has previously been set (by an ioctl). 


#define RAM RETURN 1 


struct ram_descriptor { 
char *addr; 
int size; 
short opencount ; 
short flag; 
int rdlk; 


} ram_device [RAM _MAXVOLS] ; 


ram_close (dev) 
dev_t dev; 


{ 
register struct ram_descriptor *ram_des_ptr; 
register i; 
/* check if this is status close */ 
if (RAM MINOR(dev) != 0) { 
ram_des_ptr = &ram_device [RAM DISC (dev) ] ; 
if (--ram_des_ptr->opencount < 0) 
panic("ram_close count less than zero\n") ; 
} 
/* free all ram volumes with flag set and open count = 0 */ 
/* RAM RETURN flag is set by an ioctl call * / 
ram_des ptr = &ram_device [0] ; 
for (i = 0; i < RAM_MAXVOLS; i++, ram_des_ptr++) { 
if ((ram_des_ptr->flag & RAM RETURN) == 0) 
continue; 
if (ram_des_ptr->opencount != 0) 
continue ; 
/* release the system memory */ 
sys_memfree (ram_des_ptr->addr, ram_des_ptr->size<<LOG2SECSIZE) ; 
/* zero the whole entry */ 
bzero((char *)ram_des_ptr, sizeof (struct ram_descriptor) ) ; 
. } | 
} 


STEPS TO ADD THE RAMDISK DRIVER TO YOUR KERNEL 
STEP 1) # cd /etc/cont 
STEP 2) make sure there is a line in /etc/master that looks like this: 
ramdisc ram - 3 FB 4 20 
Note: Major numbers may differ; reflect this in the mknod commands below. 
STEP 3) add "ramdisc" to your dfile 
STEP 4) compile your source file and either put it in the library that 
currently has the ramdisk driver in it or else put it in the 


makefile after step 6 


. # ce -c ramdisk.c 
# ar -rv libXXX.a ramdisk.o 


STEP 5) # config dfile 

STEP 6) # make -£ config.mk 
if you chose not to ar(1) the .o file into the library, edit 
config.mk (might want to rename it to "makefile" first) to 
include "ramdisk.o" just before the "LIBS" in the "ld" line: 


ld -abcdefg x.o y.o z.o ramdisk.o $ (LIBS) 


STEP 7) # mv hp-ux / 


STEP 8) # reboot 
STEP 9) # /etc/mknod /dev/ram b 4 OxvSSSSS Where V = volume number (0..0xf) 
# /etc/mknod /dev/rram c 20 OxvSSSSS SSSSS = # of 256 byte sectors 
“*# /etc/mknod /dev/ram128K b 4 0x000200 (block 128Kb ram volume) 
# /etc/mknod /dev/rram128K c 20 0x000200 (char 128Kb ram volume) 
# /etc/mknod /dev/ramiM b 4 0x101000 (block 1Mb ram volume) 
# /etc/mknod /dev/rramiM ¢ 20 0x101000 (char 1Mb ram volume) 
# /etc/mknod /dev/ram4M b 4 0x404000 (block 4Mb ram volume) 
# /etc/mknod /dev/rram4M c 20 0x404000 (char 4Mb ram volume) 


STEP 10)# mkfs /dev/ram128K 128 8 8 8192 1024 32 0 60 8192 (mkfs for 128Kb volume) 
# mkfs /dev/ram1M 1024 (make file system for 1Mb volume) 
# mkfs /dev/ram4M 4096 (make file system for 4Mb volume) 


STEP 11)# mkdir /ram128K 
# mount /dev/ram128K /ram128K (mount 128K ram volume) 
To make the control /dev for "ramstat". 


_# /etc/mknod /dev/ram c 20 0x0 (status is raw dev only) 


To release memory of disc #1 (and destroying all files on volume at umount) 
# ramstat -d 1 /dev/ram 


To get a status of all memory volumes 
# ramstat /dev/ram 


To reset the access counters of a memory volume # 1. 
# ramstat -r 1 /dev/ram 


SE 390: Series 300 HP-UX Internals 


System Panics 


Overview 


- Panics happen when the system thinks that "1 == 0" and 
realizes that thinking this is not a good sign :-) 


- The (mounted) disks get sync(2)ed, but are *not* marked 
clean, which will probably force an fsck(1m) when the 
system boots. 


- If running 7.0 or later, we will consider dumping physical 
RAM to the swap area (known as "Savecore"). This won’t 
happen unless there is local swap of some sort, and it 
can be disabled in 8.0 and later releases by adb(1)ing the 
kernel variable do_savecore to 0. 


- If the kernel debugger is active, control will be passed to 
it; otherwise we halt in a tight loop, and the power must 
be cycled for the system to reboot. 


- If you are seeing significant numbers of panics, the most 
likely possibility is a hardware problem. : 


- The $700 has "analyze" available, and it is very helpful in 
extracting useful information from a core dump. 
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System Shutdown 


(Hopefully un)Common Kinds of Panics 


Parity error - is a fact of life with parity-checking memory. 


Dup ialloc or freeing free {inode,frag} - usually caused by 
mounting a corrupt disk. Pay attention when the system tells 
you to fsck! 


Bus error - often indicates a hardware problem. If it happens to 
a user, he is sent a signal. It should never happen in the kernel 
and if it does the system will panic. It could also come froma 
kernel bug, but most of the ones we’ve seen have been due to 
hardware problems. 


I/O Error in Push - generally points to bad interface card, cable, 
or disk. "Push"ing a page out refers to writing a page to the 
Swap area, and the system will panic if the write() fails. 


In 8.0 this one will say something like "Syncpageio detected 
an error". 


If you know of other "legitimate" panics, let me know so I 
can include them on this list in the future. 
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System Shutdown 


Interpreting S$300/400 Panic Dumps 


First column consists of stack addresses. 

Numbers in the other columns that are either in the first one 
or are sandwiched by numbers in the first one are probably 
frame pointers. 


Find first appropriate address (frame pointer). It is the address 
of the next one, which is the address of the next one.... 


Trace linked list of frame pointers. 


Numbers just to the right of the frame pointers are return 
addresses. 


Feed return addresses to adb(1) to see who called who. 


Reading Series 300 Panic Dumps 


When in the course of human events an HP-UX system can’t figure out what’s 
going on, it throws up its hands and decides to reboot and try again. When 
this happens, it is known as a "panic", and the system tries to be helpful 
by printing out the contents of the kernel stack as it dies. Here is part 
of one: . 


97bdaa: 00051¢c90 O000ffe01 ffe79405 ££fe79401 00000000 00979018 O00ec7fa O00ec7£a 
97bdca: 0006889a 00000000 0000e000 O006£E66c 0097beE26 00015314 O00ec7£fa 00000184 
97bdea: 00000000 0000e000 00000000 00000000 03000000 00000000 00000000 00000000 


The first column consists of stack addresses. The stack grows down in memory, 

so the top line is the stuff that has been put on the stack most recently. The 
trace goes from left to right, so the lowest address (most recently pushed) is 

at the top left; the highest is at the bottom right. 


The last eight columns are the actual contents of the stack. There are several 
kinds of things on it: 

- arguments to functions 

- return addresses 

- frame pointers 

- local variables for functions 

- saved copies of registers that will be trashed in the called function 

- exception information (stuff put there in case of divide by 0, etc) 

- junk 
It would be nice if the last item didn’t have to be there, but it does. This 
is because not all code uses the conventions established by the HP-UX C 
compiler. This will be dealt with a bit later. 


The second item in the list above is a very important one - it is the key to 
our ability to trace back through the dump. When a procedure is called, it 
pushes the frame pointer (register a6 on the 680x0) onto the stack and then 
copies the stack pointer into the frame pointer. It then subtracts from the 
stack pointer (remember that the stack grows down) to make room for local 
variables. The fact that the old frame pointer is pushed each time a 
procedure is called is what enables us to "walk" or "unwind" the stack. 


Since the frame pointers are stack addresses, the basic idea is to look 
through columns 2-9 for a number that either appears in column 1 or is 
sandwiched by two numbers in column 1. An important thing to remember is that 
the addresses may be misaligned by two bytes. An example may help here: 


98c9da: 00234567 0098c9fa 00034562 
98c9fa: ..... 


The "0098c9fa" was properly aligned, but if the line had read 
98c9da: 00234567 89ab0098 c9fa0003 


that would have been OK too. Once the first address has been found, others 
can be found by treating each one as a pointer; i.e., the frame pointers form 
a linked list. a 


Surrounding each frame pointer is some interesting information. It is often 
referred to as an "activation record". The first part of the record will be 
arguments for the called procedure (keep in mind that these are treated as 
local variables by the called procedure and thus may have been modified by 
it). Next, a return address for the calling procedure. Third, the saved 
frame pointer. Next, space for local variables in the called procedure. 
Last, space for registers that the called routine wants to use. 


Consider the following example. 


The lines of the dump have been split apart 


and directional lines have been drawn to show the linked list structure. 


panic: init died 


panic: 


97be4a: 


97be6a: 


97be8a: 


97beaa: 


97beca: 


97beea: 


97bf0a 


97bi2a 


97bfi4a: 


97bf6a: 


97bf= 8a: 


The 


97bfaa: 


97TbEca: 


97bfea: 


sleep 


0007f£24 


00000094 


| 00000080 


| 
| 
| 0001dd7¢ 


0097bEEE6 


00004904 


00000001 


0124a6aa 


0097bE52 


00989fe0 


00004ae4 


00000000 


0000800a 


00000000 


0007f£8fc 


00000003 


0007febc 


0097bfaa 


0124a6aa 


0009ce08 
00016cc8 


000ffEcO01 


£LLC7dFc 


0125babe 


fffffFa28 O0001laib4 00000000 ffFEL7E9I8 


buck stops here - this address isn’t 


00000005 00000001 00000001 00000020 


00000031 00000040 00012016 0001a100 


f£ff£7e00 FLLLF7d£8 00000000 O000llacc 


0124a6aa 0097be76 000107ca 


00010062 


0125£280 


0009ce08 


££cb0405 


0125babe 


00000003 


00000007 
close to 
000£fcO1 
ffcab004 


008s0000f 


0124a6aa 
0000000a 
f££ECT7TdEFc 


ffcb0401 


00000002 


00600000b 


00000080 


0000000a 


0125£280 


00000001 


00000001 


0000003c 


00000040 


0124a6aa 


01242000 


0008022b 


01242000 


0000003c 


0000ac8c 


0097bf£46 


Vv 


ffcab004 


f£FFF7e00 00000458 0097bfaa 


AAAAAAAA 


what’s in the left column. 


£f£cb0405 ffcb0401 00000700 


f£ffffa28 0001lalb4 00000000 


fcbl 


It is important to remember that much of this is dependent on routines using 


the normal calling convention. 


There will be exceptions to this. 


If someone 


writes a routine in assembly language and doesn’t bother to save the frame 


pointer, this will mess things up a bit. 


The frame pointers will be good, but 


one of the activation records will have a return address that doesn’t make too 


much sense, because there is not a matching frame pointer. 


The same thing 


will happen if an exception (such as a bus error) is encountered in kernel 
Note that either of these things can cause small glitches in the trace, 
but they don’t necessarily mean the end of the hunt. 


mode. 


A third oddity is introduced when a routine is called indirectly. Probably 
the most common example of this is a kernel routine named syscall(); it calls 
the actual code for a given system call by jumping indirectly. Indirect calls 
don’t automatically end the trace, but the one in syscall() often does. The 
reason is that the stack that is dumped out is the *kernel* stack - we can’t 
walk back into user land on the kernel stack. One thing that an indirect call 
will always do is make things a bit less clear later on when we are trying to 
figure out who called whom. 


Once the stack has been unwound, how do we find out what the numbers mean? The 
easiest way is probably to use the assembly level debugger, adb(1). If adb(1) 
is run on the kernel that panicked (or one that is the same version and has 
been configured IDENTICALLY), it will translate absolute addresses into 
symbolic ones. By giving each address to adb(1) and doing a bit of 
interpretation, a symbolic traceback can be constructed. It will usually have 
things like boot() and panic() at the top and things like read() or setuid() 

at the bottom. The important stuff will be in the middle. 


To start, use a command something like this: 
$ adb /hp-ux 


Once adb(1) has started up, you can get it to do things like tie absolute 
addresses to known symbols or disassemble parts of the code. The fundamental 
command we will use will be of this form: 


<address>?<n>i as in 32cea?20i 


The address is typically an absolute hexadecimal number, the question mark 
says to print out what that address is, <n> is the number of times to do it, 
and "i" tells it to interpret the stuff as instructions. It can safely be 
said that adb(1) is not one of the friendlier HP-UX utilities. For 
instance: there is no prompt, and the commands (as seen above) are a bit 
cryptic. Note that to exit you have two choices: "$q" or the old standby, 
CTRL-d. And now back to our story.... 


Since we know that the return address is just to the right in the printout 
(was pushed just before the frame pointer), we can take this number and feed 
it to adb(1) to find out what routine made the call. In the 2nd example, the 
return address was 00034562. To find out what routine that is in, we might 
use this: 


34562?i 
To see a bit of context, we would do something like this: 
34550?20i 


There is a catch with this. This is because instructions will sometimes be 
aligned on even byte (word) boundaries, not on 4 byte (longword) boundaries. 
Thus, if you tell adb(1) to start disassembling at an address that is halfway 
through an instruction, you will get a bogus list of instructions. One way of 
detecting this is to look and see if there is some kind of call instruction in 
the disassembly listing - if there isn’t, chances are *excellent* that the 
disassembly is misaligned. 


For an example, we’1ll look at the addresses in the stack tracing example 
above. Just to the right of each frame pointer is the return address for that 
call. By feeding these to adb(1), we can figure out who called whom. What 
follows is a logfile of a session with adb(1), with three things done to it: 
1) blank lines have been inserted for clarity; 2) most of the tries that 
yielded misaligned results have been eliminated; 3) comments have been added; 
they start with "#". 


O 


O 


O 


$ adb /hp-ux 
executable file 
core file = core 
ready 


107ca?i 
_biowait+0x22: 


1L07af?10i 
_biowait+0x7: 


107b0?10i 
_biowait+0x8 : 


1006271 
_bwrite+0x92: 
10050?10i 
_bwrite+0x80: 


1450a?i 
_Sbupdate+0x4C: 
144£0?10i 
_sbupdate+0x32: 


l6cc8?i 
_update+0xD4: 
16cb0?10i 
_update+0xBC: 


/np-ux 
addq.w &0x8,%a7 
bgt.w _bmap+0x523 
eor.b %d4,%d0 
ori.b &0xFFFFEC2D, al 
mov $sr,??? 
fsun - (%a0) 
movq &0x0, %d4 
sub.w %a0,%d2 
subgq.w &0x2,%aé 
eor.b %d4,%d0 
ori.w &0x1C50, 2??? 
ori.b &0x4EB9 , a0 
ori.b &0x9EC, +d0d 
mov.1 %da0, -0x4 (a6) 
bra.b _biowait+0x24 
pea 0x94 .w 
pea (%a5) 
jsr _Sleep 
addq.w &0x8,%a7 
mov.1 (%a5) , do 
movqg &0x2,%d1 
mov.1 aS, (%a7) 
jsr (%a0) 
addq.w &0x4,%a7 
btst &0x8 , a7 
bne.b _bwrite+0x9E 
pea (a5) 
jsr _biowait 
mov.1 a5, (%a7) 
jsr _brelse 
addq.w &0x4,%a7 
bra.b _bwrite+0xAE 
mov.l 0x34 (%a5) , (%a7) 
mov.1 %d0, - (a7) 
mov.1 0x22 (%a4) , - (%a7) 
pea (%aS) 
jer _bcopy 
lea OxC (%a7) , %a7 
pea (%a4) 
jsr _bwrite 
mov.1 0x34 (%a5) , (%a7) 
mov.1 0x34 (%a5) , dO 
subq.1 &0x1,%d0 
addq.w &0x4,%a7 
clr.b OxDO (%a0) 
mov.1 -0x4 (a6) , a0 
mov.1 


_time, 0x20 (%a0) 


not looking good 


should be a call to sleep 
in here somewhere 


try again! 


now we’re talking... 
pop 8 bytes of args off stack 


Ad ln ANID al ne ina! Maeda ead 


\ 
wy 


99f4?71 
_boot+0x8A: 
99e6?10i 
_boot+0x7C: 


ac8c?i 


_panic+0xC4: 


ac7c?6i 


_panic+0xBé4: 


4ae4?i 


_e@xit+0x1D8: 


4ad0?10i 


_exit+0x1C4: 


490471 


_rexit+0x20: 


48£4?710i 


_rexit+0x10: 


ebdc?i 


_syscall+Ox15E: 


ebc8?10i 


_syscall+0x14A: 


(%a4) 
_Sbupdate 
&0x4,%a7 
0x18 (%a4) , a4 
%a4, &0x9CFE8 
_update+0x42 
_inode, %a5 


&0x4 ,%a7 


_boot+0x90 

0x0 .w 

_update # this is the one 
&0x4,%a7 

_boot+0x9C 

Oxl.w 

_update 

&0x4,%a7 
_reboot_after_panic+0x1E0 

_printf 


&0x8 , $a7 


(68881) 

0x8 (%a6é) 

- 0x4 (%a6) , - (a7) 
_boot 

&0x8 , ta7 
_panic+0xCé 


&0x4,%a7 


$d4,%d6é 

%da0, 0x2A (%a5) 
_exit+0x1DA 
_nsysent+0x88 
_panic 

&0x4,%a7 

OxA (%a6) , Ox52 (%a5) 
_u+0x84E,0x9C (%a5) 
_u+0x84A, 0x98 (%a5) 
_u+0x846, 0x94 (%a5) 


&0x4,%a7 


&0xFF, %d0 
&0x8 , dod 
%d0, - (+a7) 
_exit 
&0x4,%a7 
(%a7) , %aS 
a6 


a6, &OXFFFFFFFO 
&<%A7,%a4,%a5>, (%a7) 


_u+0x78,%a0 


$d2,%d0 


&0x1, (%a0) 
_u+0x9FA, a0 


clr.w (a0) 

mov.1 0x4 (%a3) , a0 

jer (%a0) # note indirect call 
lea _u+0x78,%a0 

tst.b (%a0) 

beq.b _8yscall+0x186 

lea _ut+O0x9FA, a0 


$q 


By looking at this bottom-up, we can see that the order of calls was like this: 
syscall () 
rexit () 
exit () 
panic () 
boot () 
update () 
sbupdate () 
bwrite () 
biowait () 
sleep () 


Note that we didn’t see a "jsr _rexit" in syscall(); we just looked at where 
we had been before. 


What can we learn from all of this? That depends. It is conceivable that 
this kind of information could help track down a kernel bug. It is also 
possible that it could satisfy a customer’s curiosity. One nice thing to know 
is that as of 6.0, the kernel will construct a sybolic traceback complete with 
the arguments to the calls - this will be printed on the screen just below the 
stack dump. 
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The Big Picture 


How does HP-UX organize disks and access files? 


The Little Pictures 


History. 
The vnode layer & pathname lookup. 
Caching: buffers, inodes, cdnodes, and directory names. 
Mounting and unmounting file systems. 
General flow within the kernel. 
The HFS /Berkeley /McKusick file system. 
- History and layout. 


- On-disk data structures. 


O The Problem 


A customer calls and says that he can’t boot. You go to help him 
out, and take a loaner disk. You boot off of the loaner and try 
to fsck(1im) his disk. It fails, and after a bit of poking around 
you deduce that someone has tar(1)ed over the first part of his 
Gdisk. What will you do? 


O 
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File System 


The original UN*X file system 


Superblock (single copy on disc) 

I-nodes (grouped together) 

Data blocks (small size = 512 bytes) 

Advantages: 

* handles large numbers of small Files efficiently 
* easy to implement 

Disadvantages: 

* limited file I/O throughput 

* lack of locality on disk 

* lack of robustness 


* designed for "small" systems/disks 
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File System 


Picture of a Bell file system 


Boot 
Block 


(BB) 


Super I-nodes | Data 
Block . Blocks 
(SB) (I-n) ; (DB) 
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How The Kernel & File System Fit Together 


t------- ee + 
| Kernel | 
to-- cere nee + 
+ -——e— we we ew ee we Oe Bw Be ww Bw ee ee ee ee ee ee 
te------ Je eecee + 
| Vnode Layer | 
S aiceietietetiatdeetate + 
Sei ee ec + 
| | | | 
tore eee eee + too ec errene- + tone ner eee + to---------- 
| Diskless | | NFS | | UFS | | CDFS 
preter crte ee + too - ere + saa a S elietieitatettatatcatetatied 
Peer eaisrensse + homens A fees + 
V V 
+----- oer eee + | cietceitetattatiettatatatadiataiandetatnedie + 
Oo | LAN -> server | Buffer Cache -> 
| Hoon c ence eee nee + dev. drivers -> 
disk 
poor ee eee eee ee ---+ 


Aan isti ce 


The Vnode Layer 


- Why? 


How? 
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aan 


It does for the filesystem what the device driver 
interface (open, close, read, write, strategy, etc) 
did for device drivers. | 


To allow the system to access files that are on a remote 
machine, or that are on a disk that isn’t HFS. 


To be compatible with the industry. 


Most file system activity revolves around "vnodes", which 
are like inodes but are not implementation dependent. 


vnodes only exist in-core, and are part of in-core inodes 
or cdnodes or... 


in-core inode 


to ------eee a 4: At boot time, the vnode 

| vnode | in each in-core inode 

+----------- + will be initialized to 
point at HFS routines; 

S etietetatatatetetatate + if CD-ROM is configured 

into the system, the vnode 

in each cdnode would be 

set up to point at CDFS 

+----------- + functions. 


The vnode layer is object-oriented in the sense that a 
vnode carries around a list of operations that can be 
done on it. If the system wants to read from a file 
represented by (struct vnode *)vp, it will do something 
like this (this is not actual code): 

(* (vp) ->v_op->vn_read) (vp, rwflag, buf, size) 
This will call a routine to read from the file, whether 
the file is local, remote, on a PC, or whatever. [In 
concept, it is roughly this: 


switch (vp->v_type) 
case VHFS: hfs_read(vp, rwflag, buf, size) 
case VNFS: nfs_read(vp, rwflag, buf, size) 


case VCDFS: cdfs_read(vp, rwflag, buf, size) 
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Pathname Lookup 


- Many system calls take a character string that is a pathname. 
Before they can do much, they must figure out where the file 
is and what type it is. This requires lots of work.... 


- The basic plan of attack for lookupname() is to look 
for the vnode that corresponds to the pathname we’re 
interested in. Here’s a greatly simplified view: © 


while there’s another element in the path Sut lof 


if that element is in the dnlc~# pth 
use the vp there CY 
else 
call _lookup for the type of 
fs the current component is in 


There are some "gotchas" left out here (RFA, Diskless, mount pts), 
but this is the guts of the algorithm. The "else" clause above is 
important - it’s what allows us to cleanly resolve pathnames even 
though each element of the path may belong to a different fs type. 


fy Meet 
in-core inode table 
+ 


DNLC <=? 
O nnn 


+o SSS SSS S555 SSeS 5==>=+ 
| | | vp | 
Lab -sec4ee- 
| pvp 
tee ee ee ee ee ee ee Ke Kee + teers SSeS essSsSsseseeseee+ 
+oo-- ee ee eee ee ee ee ee + +See eesSSeeSeSSeSSSSe=+ 
| vp 
uSr === ee 
| pvp 
toe eee ee ee ee er ee ee ee + +-sSeScSeseseSsSesesesSeaeS==+ 
er + $SeSsSSSSSSSS2S2 SSS SS222=+ 
| vp 
lib - --"+--- 
| pvp 
ee + resets ssssseseseeseoees=+ 
tee eee eee ee ee ee ee ee + ¢ SSeS EoosSse2reSeesssseees=+ 
| vp 
focal. “#Srecs 
| pvp 
tee ee ee ee ee eH ee eee ee + +-SSeSssressss2=essseeeeeesc=+ 
tee ee ee eee He ee ee eH eH ee + fers ssSeSeSsSeSsesSsSesseeossascor 
| vp 
Din “SS pesss 
| pvp 
tee eee ee ee eee ee ee ewe + f-soSsSsSsSeoneSSeseSSeSsSeeeena+ 
, fee ee ee ee ee ee ee ee ee + $-SesSSSsSS SSS SSS SS + 
O c 
| pvp 
ten - e- ee eee ee He ee ee ee + t+ SeSSsereoeSeSSSeeen eset 
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© Caching 


- The buffer cache - used to avoid reading things that were read 
"recently" and to keep from having to write stuff out if it’s just 
going to get trashed shortly. Buffers are also available for use 
as scratch space if drivers need to use them.. 


- Prior to 9.0, the buffer cache was sized by nbuf/bufpages; 
if these were nonzero, the system used them; otherwise, 
68K machines would use 10% of the 1st 5MB and 5% of 
the rest of RAM; PA boxes would use 10% of RAM 


- In 9.0, we have a "dynamic buffer cache" (DBC); it is 
still possible to set a specific size using the tunables 
above, but in general it is best to let the system 
grow/shrink the cache as needed - as the filesystem 
uses pages, the cache size will grow; if the system 
runs short of memory (user processes ask for some), the 
pager will take pages back from the DBC. 


If the DBC is taking too much, either set nbuf /bufpages 
explicitly or else adb(1) dbc_ceiling to set a limit 
on the size of the cache. 


O | dbc_ ceiling ---> +--- physmem ---+ 


<--- bufpages ~ a 


dbc_bufpages ---> 


- dbc_ bufpages is the "floor" - the minimum number of pages 
the cache will have (default 64) 


- dbc_ceiling is the maximum - (default physmem) 


- bufpages is the current number of pages taken by the 
cache (if you set bufpages explicitly, it will do at 
boot time what it used to.- hold the cache at that size) 


- The inode cache - used to keep track of inodes so that we don’t 
always have to get them off of the disk. Pathname translation 
boils down to accessing lots of inodes, so the less often we have 
to get them from disk the better. If a file on a Berkeley FS 
disk is open, there *must* be a copy of its inode in-core. 


- Directory name lookup cache - speeds up pathname translation. 
It consists of a set of filenames and their respective vnode 
i pointers. The system is frequently asked to open files in 
O /usr/lib; thus it makes sense to have "usr" and "lib" sitting in 
the cache. This will often save several disk accesses fora 
Single pathname translation. The name is somewhat misleading; 
there are ordinary filenames in the cache too. 
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6) ee And Umounting File Systems 


- Only block devices need apply :-) 


- Mounting a disk with vfsmount (2) makes that disk’s file system 
a part of the present file system; its root "covers" the directory 
we mount it on. 


- Pathname lookup is affected. When lookupname() is resolving a 
pathname, it.checks the vnode for each element to see if it has 
been "covered". If so, it jumps to the "covering" vnode and 
continues the search. The "is this thing covered?" question is 
asked before "where’s the directory this vnode corresponds to?" 


There is also a possibility that the current vnode is covering 
another one and we are moving *up* in the directory hierarchy 
(what if we are resolving "../.."?); in this case, we must jump 
to the vnode we are covering and continue on. 


- When a disk is mounted, it is added to a list of mounted file 
systems. This is used for a number of things, not the least 
of which is when reboot(2) is shutting down the system. In that 
case, it’s important that we not have to rely on /etc/mnttab! 


- When a disk is umount (2)ed, the system checks to make sure no. 
files on that disk are open; if they are, the umount(2) will fail 
O with EBUSY. No such checking is done with vfsmount(2) (try it :-) 


O 
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Important Data Structures 


Per process: 


u_ofile - semi-static array in each process’s u area. A 
"file descriptor" is just an index into this array, so 
whenever a process open(2)s a file, a slot in this array 
is taken up. In >=8.0, this array will be dynamic 

and will be sized by calls to setrlimit(2), with an 
upper bound of "maxfiles_lim" (1024). . 


u_rdir - vnode pointer for this process’ root directory. 
See sys/user.h 


u_cdir - vnode pointer for this process’ current directory 
See sys/user.h How does this interact with "cd"? | 


In 8.0, all of the above move to the proc table entry. 


System wide: 


>> 


file - the kernel open file table. There is at least 
one slot in it for each file or socket that is open, 
and it is sized by the tunable parameter "nfile". 

See sys/file.h. 


inode - the inode cache. There is a slot in it for 
each inode that is in core (remember that we do caching, 
so a given in-core inode isn’t necessarily being used), 
and it is sized by (all together now :-)) "ninode". 
Every file that is open on a local HFS disk *must* have 
*one* slot in the inode table. See sys/inode.h and 
sys/ino.h. 


ncache - the directory name lookup cache, also sized by 
"ninode". 


In 8.0: fs_async decides whether the filesystem should 
lean toward reliability or performance. If it is set 
to 0 (default), the system will write inodes/blocks to 
disk more often, which give reliability at the expense 
of performance. If it is 1, the system will delay 
these writes, yielding a great deal of performance 

in some situations and very little in others. 


Having it set to 1 pretty much guarantees having to do 
a manual fsck(im) if the system crashes or loses power. << 
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Relevant Kernel Structures 


proc u area file 
+tt+e++44 tttttttes+ +ttetttst 
- |u_procp 
Pieeo || ecole a 
<-- / ---> 
u_ofile 
tttteeee+ [oe ----- os 
------- \ t+tt+t++++ 
; --> 
tte+ttee+ --\ 
+H4+4444+ | 
: ---> 
+ttttttst 


- an in-core inode looks something like this: 


t---- reer eee + 
tooo eeeeee + 
vnode | 
too ccc eee + 
Ss card + 
on-disk 
inode 
toc ccc cree + 
tec ccc crc cc eee + 


inode 


t+ttetttt 


tt+++te4+ 


PROTON ects vs BTR PRESET RI 


La the? 


CD-ROM Layout 


Our CD-ROM support conforms to the High Sierra & ISO-9660 


standards. 


System Area 
16 sectors = 32 kbytes 


2 kbytes 


Supplementary Volume Descriptor 
2 kbytes 


Volume Descriptor Set 
Terminator 
2 kbytes 


=e ee me ew Ow he Ee BP FP Me HB wm Be eB Be Bw Bw ew ew Bw ew eee ew ew | 


LITLITLTTLTTTTLTLTTLLTT TST TT TTT TTT 


meme eee ee ee eee BP BB Pe ee 


Path Tables 


ee ee 


Here’s a rough sketch of how a CD-ROM is organized: 


Contents not specified by standard 
Descriptor for ist volume 


Descriptors for additional volumes 


Piece of data marking end of 
volume descriptors 


Potential empty space 


Potentially four path tables: the 
two required M & L tables, plus tw 
additional optional M & L tables 
Potential empty space 


Root directory for first volume 


Data (files) for first volume 


Potential empty space 


Root directory for next volume 


Data (files) for next volume 


i{ 


EET eSanr Ol OHS WASTE EIN! Taser h at ers OMS NOE a aa 


ECC;:, 


till end of CDROM or data 


tZe 
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The Berkeley/McKusick file system 


- often referred to as "HFS" or "ufs" 


- retains advantages of the original Bell design 


- includes remedies for most problem areas 


* throughput: larger block.size (4/8 Kbytes) 


* locality: 


introduction of "cylinder groups" 
(each resembles a Bell file system) 


* robustness: superblock is replicated in each group 


* extensible: can access files of 4+ Gbytes 


(theoretical maximum ~ 4 Tbytes) 


- see fs(4) for an explanation of many of the fields 
in the superblock 


- minfree is a space-for-time tradeoff; the filesystem wastes 


some space 


in order to make block allocation stable and fast; 


note that it is a *percentage*, not a fixed amount (yes, this 
is still true on 1.3GB disks....) 


- a cylinder 
a cylinder 


group contains a backup copy of the superblock, 
group block, some inodes, and some data 


the information that changes in the superblock is 
the kind of thing fsck(1m) can fix, so once the 
filesystem is built the redundant superblocks are 
not normally updated (convertfs(1lm) is the most 
common exception) 


there is a fixed number of inodes per cylinder group 
the information about which blocks/inodes are free 
is in bit maps in the cylinder group blocks, 

cg free[] & cg_iused[] 


the last time CGB was written is stored in cg _ time, 
which is helpful to know when trying to "un-rm" 
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Picture of a Berkeley file system 


cylinder group 0: 


SB I-n 


BB DB | 


SB | CGB 


A 


\-- CG summary info 


cylinder group 1: 


DB | SB CGB | ai a | DB | 


cylinder group 2: 


DB | SB | CGB | I-n | DB | 


Note that the groups are "walking" to the right - this is 
because the system tries to stagger the backup superblocks 
*all over* the disk. Given this staggering of the CG 
beginnings, it would be hard’ to find the inodes or CGB 

or backup SB for any particular CG, except that there 

are macros that will do it for us. 


cgsblock(&sb, 5) will return the fragment address 
of the beginning of the superblock 
stored in CG 5 


cgimin(&sb, 21) will return the fragment address 
of the first inode in CG 21 


[4 
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The space on a disk really comes from sectors that are organized into 
tracks that are organized into surfaces/platters.... However, it is 
easier to think about it in terms of a flat logical address space 
(which is the interface modern disks present) : 


64K 


72K 


2048K 


2056K 


2064K 


2072K 


2080K 


2088K 


2096K 


2104K 


2112K 


2120K 


2128K 


2136K 


ee TS NE A EE RS SE A SS SS SS ES SS SD 


boot block 
primary superblock 
CG 0 superblock 
CG 0 cgblock 
CG 0 inodes 
CG 0 inodes 
CG 0 inodes 


CG 0 data 
| 
| 
| 
V 


CG 1 data 
CG 1 data 
CG 1 data 
CG 1 superblock 
cg 1 cgblock 
CG 1 inodes 
a 1 inodes 
CG 1 inodes 


CG 1 data 


<¢——— 


| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 
| 


both this and the primary SB are 

CG 0 "data"; they just don’t belong 
to any particular file.... 

<--- cgsblock(&super, 0)*super.fs_fsize 
or SBLOCK*DEV_BSIZE 


<--- cgimin(&super, 0) *super.fs_fsize; 


(rest of CG 0 is data) 


<--- cgbase (&super, 1)*super.fs_fsize 
Notice this data in front of CG l1’s 
superblock - CG 2 would have even more 
of it - this is to scatter superblocks 
all over the disk. 


<--- cgsblock (&super, 1)*super.fs_fsize 


<--- cgimin(&super, 1)*super.fs_fsize 


IY4b 


How UFS 
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Files Are Accessed 


(The following notes assume no non-UFS elements in the path) 


Directories contain i-number, record length, name length, 
and filename (the record length is in there so that 
deletions can be handled simply - we just add the record 
length of the entry being zapped to the previous one. 


Root directory is called "/" and its i-number is always 2, 
which is why we need both a device and an i-number to uniquely 
identify a file. 


The inode has things like modification/access time stamps, 
modes, uid/gid, etc, as well as pointers to the actual 
data blocks. The structure of an inode is defined in 
/usr/include/sys/inoth. 


To find a file, the kernel must start from the current directory 
or the root (depending on whether the name starts with "/") and 
go through a directory and an inode per element of the path. 


The directory is the *only* place on the disk where the filename 
is stored; the inode has everything else about the file. 


Normally, directories should be read with opendir(3) /readdir (3); 
when you are reading them straight from the disk, though, be sure 
to use the structure defined in /usr/include/ndir. h. 


inum rlen nlen name 


EO Raa ee eR SAE CN RESO Oe Re ge Cr 
a a es oe ees es 

| 2 Ja[2t. | 
| 3 | 20 | 10 | lost+found 
Lo ae eee oe 
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Pathname lookup 


- To use a path like "/users/se/smith", the kernel must translate 
it to an i-number (or cd-number, etc.) To do this, it chops 
the path up into individual names and lets the appropriate 
filesystem code handle looking for the next name in that one 
(assuming it’s a directory; if it’s not, we must be done or 
else the user goofed). 


The McKusick filesystem staggers backup superblocks around the disk, 
and tries to put a file’s data, directory entry, and inode close 


together: 
cylinder group 0 
BB | SB SB CGB I-n DB | | 
cylinder group 1 
| DB SB ccB | I-n DB | 
O , | c linder group 2 


| DB SB | CGB I-n | DB | 


On-disk data structures 


- The superblock has fundamental information about the whole 
filesystem: the block/fragment size, the number of cylinder 
groups, the magic number, etc. 


- All of the interesting information about a file (except its 
name) is in its inode. 


- The actual block pointers for the file’s data are expressed 
as fragment addresses and are found in the inode. There are 
12 direct-block pointers and 3 indirect block pointers. The 
1st indirect-block pointer points at a block of pointers to 
real disk blocks; the 2nd points to a block of pointers to 
blocks of pointers to real data; the 3rd is presently unused :-) 
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/* 


* dnic.h: $Revision: 1.3.61.2 $ $Date: 91/06/19 13:45:42 $ 


* SLocker: §$ 
*/ 


#ifndef _SYS_DNLC_INCLUDED 
#define _SYS_DNLC_INCLUDED 


/* 


* Copyright (c) 1984 Sun Microsystems Inc. 


ai 
/* 


* This structure describes the elements in the 


* names looked 


ad 


up. 


#define NC_NAMLEN 


struct necache { 

struct 
struct 
struct 
struct 
char 

char 

struct 


bs 


#define ANYCRED 
#define NOCRED 
/* 


int ncesize; 


nceache 
ncache 
vnode 
vnode 


ucred 


15 /* 


*hash_next, 


((struct ucred *) 
((struct ucred *) 0) 


struct ncache *ncache; 


*/ 


#define NC_HASH SIZE 


256 


cache of recent 


maximum name segment length we bother wi 


*hash prev; /* 


*lru_next, *lru_prev; /* 
*Vp; /* 
*dp; /* 
namlen; /* 
name [NC_NAMLEN] ; f* 
*cred; {* 


-1) 


/* size of 


* Stats on usefulness of name cache. 


{ 

hits; /* 
misses; /* 
long_enter; /* 
long_look; /* 
lru_empty; isi 
purges; /* 


* Hash list of name cache entries 


/* 
*/ 
struct necstats 
int 
int 
int 
int 
int 
int 
}; 
/* 
iets 
struct nc_hash 
struct 
}; 


{ 


neache 


*hash_next, 


hash chain, MUST BE FIRS 
LRU chain */ 

vnode the name refers to 
vno of parent of name */ 
length of name */ 
segment name */ 
credentials */ 


hash table */ 


hits that we can really use */ 


cache misses */ 


long names tried to enter */ 
long names tried to look up */ 
LRU list empty */ 

number of purges of cache */ 


for fast lookup. 


*hash_prev; 


O 
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7 
8 
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@(#)fs.h: $Revision: 1.17.61.2 $ $Date: 91/06/19 15:45:29 $ 
$Locker: § 


il f 


/* 


@(#) $Revision: 1.17.61.2 $ */ 


#ifndef _SYS_FS_ INCLUDED /* allows multiple inclusion */ 
#define SYS FS INCLUDED 


/* 


* 


+ + + $+ + $F FF FF F FF FF FF FF FE OF 


Each disk drive contains some number of file systems. 
A file system consists of a number of cylinder groups. 
Each cylinder group has inodes and data. 


A file system is described by its super-block, which in turn 
describes the cylinder groups. The super-block is critical 

data and is replicated in each cylinder group to protect against 
catastrophic loss. This is done at mkfs time and the critical 
super-block data does not change, so the copies need not be 
referenced further unless disaster strikes. 


For file system fs, the offsets of the various blocks of interest 
are given in the super block as: 
[fs->fs_sblkno] Super -block 
[fs->fs_cblkno] Cylinder group block 
[fs->fs_iblkno] Inode blocks 
[fs->fs_dblkno] Data blocks 
The beginning of cylinder group cg in fs, is given by 
the ‘‘cgbase(fs, cg)’’ macro. . 


The first boot and super blocks are given in absolute disk addresses. 


* 

/ 

#define BBSIZE 8192 

#define SBSIZE 8192 

#define BBLOCK ( (daddr_t) (0) ) 

#define SBLOCK ((daddr_t) (BBLOCK + BBSIZE / DEV_BSIZE) ) 


™~ 
+ 


+ £ © € $$ $€ FF F FF FF FF F F OF 


Addresses stored in inodes are capable of addressing fragments 
of ‘blocks’. File system blocks of at most size MAXBSIZE can 
be optionally broken into 2, 4, or 8 pieces, each of which is 
addressible; these pieces may be DEV_BSIZE, or some multiple of 
a DEV_BSIZE unit. 


Large files consist of exclusively large data blocks. To avoid 
undue wasted disk space, the last data block of a small file may be 
allocated as only as many fragments of a large block as are 
necessary. The file system format retains only a single pointer 

to such a fragment, which is a piece of a single large block that 
has been divided. The size of such a fragment is determinable from 
information in the inode, using the ‘‘blksize(fs, ip, lbn)’’ macro. 


The file system records space availability at the fragment level; 
to determine block availability, aligned fragments are examined. 


* 
~s 


IS 


© 
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Cylinder group related limits. 


* 
* 
* For each cylinder we keep track of the availability of blocks at differe 
* rotational positions, so that we can lay out the data to be picked 
* up with minimum rotational latency. NRPOS is the number of rotational 
* positions which we distinguish. With NRPOS 8 the resolution of our 
* summary information is 2ms for a typical 3600 rpm drive. 
*/ 
#define NRPOS 8 /* number distinct rotational positions */ 


/* 

* MAXIPG bounds the number of inodes per cylinder group, and 
* is needed only to keep the structure simpler by having the 
* only a single variable size element (the free bit map). 
* 
* 


N.B.: MAXIPG must be a multiple of INOPB(fs). 
*/ 
#define MAXIPG 2048 /* max number inodes/cyl group */ 


~s 
* 


MINBSIZE is the smallest allowable block size. 

In order to insure that it is possible to create files of size 
2°32 with only two levels of indirection, MINBSIZE is set to 4096. 
MINBSIZE must be big enough to hold a cylinder group block, 

thus changes to (struct cg) must keep its size within MINBSIZE. 
MAXCPG is limited only to dimension an array in (struct cg); 

it can be made larger as long as that structures size remains 
within the bounds dictated by MINBSIZE. 

Note that super blocks are always of size MAXBSIZE, 

and that MAXBSIZE must be >= MINBSIZE. 


t+ + + $F F F FF  F 


* / 
#define MINBSIZE 4096 
#define MAXCPG 32 /* maximum fs_cpg */ 


/* MAXFRAG is the maximum number of fragments per block */ 
#define MAXFRAG 8 


#ifndef NBBY 

#define’ NBBY 8 /* number of bits in a byte * / 
/* NOTE: this is also defined * / 
/* in param.h. So if NBBY gets */ 
/* changed, change it in */ 
/* param.h also */ 

#endif 


/* 
* The path name on which the file system is mounted is maintained 
* in fs_fsmnt. MAXMNTLEN defines the amount of space allocated in 
* the super block for this name. 
* The limit on the amount of summary information per file system 
* is defined by MAXCSBUFS. It is currently parameterized for a 
* maximum of two million cylinders. 
* 
/ 
#define MAXMNTLEN 512 


IG 
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113 #define MAXCSBUFS 32 


114 

115 /* 

116 * Per cylinder group information; summarized in blocks allocated 
117 * from first cylinder group data blocks. These blocks have to be 
118 * read in from fs_csaddr (size fs_cssize) in addition to the 
‘119 * super block. 
120 bal 

121 * N.B. sizeof (struct csum) must be a power of two in order for 
122 * the ‘‘fs_cs’’ macro to work (see below). 

123 */ 

124 struct csum { 

125 long cs_ndir; /* number of directories */ 

126 long cs_nbfree; /* number of free blocks */ 

127 long cs_nifree; /* number of free inodes */ 
128 long cs_nffree; /* number of free frags */ 

129 }; 

130 

131 /* 

132 * Super block for a file system. 

133 * / 

134 #define FS MAGIC 0x011954 

135 

136 /* 

137 * Magic number for file system allowing long file names. 

138 */ 

139 #define FS MAGIC _LFN 0x095014 

140 

141 /* 


142 * Magic number for file systems which have their fs featurebits field 
143 * set up. 


144 */ 

145 #define FD_FSMAGIC 0x195612 

146 

147 /* 

148 * Flags for fs_featurebits field. 

149 */ 

150 #define FSF_LFN Ox1 /* long file names */ 
151 #define FSF KNOWN _ (FSF_LFN) 

152 #define FSF_UNKNOWN(bits) ((bits) & ~(FSF_KNOWN) ) 
153 

154 /* 


155 * Quick check to see if inode is in a file system allowing 
(156 * long file names. 


157 */ 

158 #define IS _LFN FS(ip) \ 

159 (( (ip) ->i_fs->fs_magic == FS_MAGIC_LFN) || \ 
160 ((ip)->i_fs->fs_featurebits & FSF_LFN) ) 
161 

162 #define FS CLEAN 0x17 

163 #define FS OK 0x53 

164 #define FS _NOTOK 0x31 

165 

166 /* fs flags fields */ 

167 #define FS_INSTALL 0x80 

168 +#define FS_QCLEAN 0x01 


O 
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#define 
#define 
#define 
#define 
#define 


FS_QOK 0x02 
FS_QNOTOK 0x03 
FS_QMASK 0x03 
FS_QFLAG (p) ((p)->fs_flags & FS_QMASK) 


FS _QSET(p,val) ((p)->fs_flags &= ~FS_QMASK, (p)->fs_flags |= (val) 


/* Mirstate describes the mirror states of the root and primary swap */ 


/* devices. 
/* the root file system. 


/* bootup code will configure their states 


struct mirinfo { 


struct 


struct mirstate { /* 
u_int root:4, /* 


rflag:1, /* 
swap :4, /* 
sflag:1, /* 
spare :22; /* 


} state; 


long mirtime; Alar /* 
{ pac 


This information is only recorded in the super block of */ 
If root and swap devices are mirrored, the */ 


based on mirstate. oe] 


mirror states for root and swap 
root mirror states */ 

root clean/unconf flag */ 

swap mirror states */ 

swap clean/unconf flag */ 

spare bits */ 


mirror time stamp */ 


ts aretha Oo 


Z 


struct fs *fs_link; /* 
struct fs *fs_rlink; /* 
daddr_t fs_sblkno; /* 
daddr_t fs_cblkno; /* 
daddr_t fs_iblkno; /* 
Gaddr_t fs_dblkno; /* 
long fs_cgoffset; /* 
long fs_cgmask; /* 
time_t fs_time; /* 


long fs_size; /* 
long fs_dsize; /* 
long fs_ncg; /* 
long fs_bsize; /* 
long fs_fsize; /* 


long fs_ frag; /* 


/* these are configuration parameters */ 


/* these fields 


long fs_minfree; /* 
long fs_rotdelay; /* 
long fs_ rps; /* 


long fs_bmask; {[* 
long fs_fmask; f* 
long fs_bshift; /* 


long fs_fshift; /* 


/* these are configuration parameters */ 


/* these fields 


long fs_maxcontig; 
long fs_maxbpg; /* 


long fs_fragshift; /* 
long fs_fsbtodb; /* 
long fs_sbsize; /* 
long fs_csmask; /* 


linked list of file systems */ 
used for incore super blocks 
addr of super-block in filesys * 
offset of cyl-block in filesys * 
offset of inode-blocks in filesy 
offset of first data after cg */ 
cylinder group offset in cylinde 
used to calc mod fs_ntrak */ 
last time written */ 
number of blocks in fs */ 
number of data blocks in fs */ 
number of cylinder groups */ 
size of basic blocks in fs */ 
size of frag blocks in fs */ 
number of frags in a block in fs 


minimum percentage of free block 
num of ms for optimal next block 
disk revolutions per second */ 


can be computed from the others */ 


‘‘blkof£f’’ calc of blk offsets * 
‘‘fragoff’’ calc of frag offsets 
‘‘lblkno’’ cale of logical blkno 
‘*‘numfrags’’ calc number of frag 


max number of contiguous blks */ 
max number of blks per cyl group 


can be computed from the others */ 


block to frag shift */ 

fsbtodb and dbtofsb shift consta 
actual size of super block */ 
csum block offset */ 


Z 
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225 
226 
227 
228 
229 
230 
231 
232 
233 
234 
235 
236 
237 
238 
239 
240 
241 
242 
243 
244 
245 
246 
247 
248 
249 
250 
251 
252 
253 
254 
255 
256 
257 
258 
259 
260 
261 
262 
263 
264 
265 
266 
267 
268 
269 
270 
271 
272 
273 
274 
275 
276 
277 
278 
279 
280 


long fs_csshift; /* csum block number */ 
long fs nindir; /* value of NINDIR */ 
long fs_inopb; /* value of INOPB */ 
long fs_nspf; /* value of NSPF */ 
long fs_id[2]; /* file system id */ 
struct mirinfo fs_mirror; /* mirror states of root/swap */ 
long fs featurebits; /* feature bit flags */ 
long fs_optim; /* optimization preference - see be 
/* sizes determined by number of cylinder groups and their sizes */ 
daddr_t fs_csaddr; /* blk addr of cyl grp summary area 
long fs_cssize; /* size of cyl grp summary area */ 
long fs_cgsize; /* cylinder group size */ 
/* these fields should be derived from the hardware */ 
long fs_ntrak; /* tracks per cylinder */ 
long fs_nsect; /* sectors per track */ 
long fs_spc; /* sectors per cylinder */ 
/* this comes from the disk driver partitioning */ 
long fs_ncyl; /* cylinders in file system */ 
/* these fields can be computed from the others */ 
long fs_cpg; LA? /* cylinders per group */ 
long fs_ipg; fo : /* inodes per group */ 
long fs_fpg; /* blocks per group * fs_frag */ 
/* this data must be re-computed after crashes */ 
struct csum fs_cstotal; /* cylinder summary information */ 
/* these fields are cleared at mount time */ 
char fs_fmod; /* super block modified flag */ 
char fs_clean; /* file system is clean flag */ 
char fs_ronly; /* mounted read-only flag */ 
char fs_ flags; /* currently unused flag */ 
char fs_fsmnt [MAXMNTLEN] ; /* name mounted on */ 
/* these fields retain the current block allocation info */ 
long fs_cgrotor; /* last cg searched */ 
struct csum *fs_csp[MAXCSBUFS] ;/* list of fs_cs info buffers */ 
long fs_cpc; /* cyl per cycle in postbl */ 
short fs_postbl [MAXCPG] [NRPOS] ;/* head of blocks for each rotatio 
long fs_magic; /* magic number */ 
char fs_fname [6] ; /* file system name */ 
char fs_fpack [6] ; /* file system pack name */ 
u_char fs_rotbl [1]; /* list of blocks for each rotation 
/* actually longer */ 
}; 
/* 
* Preference for optimization. 
* 
vd 
#define FS_OPTTIME 0 /* minimize allocation time */ 
#define FS_OPTSPACE ai /* minimize disk fragmentation */ 
/* 


* 


* 
* 


*] 


Convert cylinder group to base address of its global summary info. 


N.B. This macro assumes that sizeof(struct csum) is a power of two. 


#define fs_cs(fs, indx) \ 


/* 


fs_csp[(indx) >> (fs)->fs_csshift] [(indx) & ~ (fs) ->fs_csmask] 


ye 
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* MAXBPC bounds the size of the rotational layout tables and 
* is limited by the fact that the super block is of size SBSIZE. 
* The size of these tables is INVERSELY proportional to the block 
* size of the file system. It is aggravated by sector sizes that 
* are not powers of two, as this increases the number of cylinders 
* included before the rotational pattern repeats (fs_cpc). 
* Its size is derived from the number of bytes remaining in (struct fs) 
* 
/ 
#define MAXBPC (SBSIZE - sizeof (struct fs) ) 
/* 
* linder k j ms 
ae der group bloc HERS file system 
#define CG_MAGIC 0x090255 
struct cg { iN 
struct cg *cg link; /* linked list of cyl groups */ 
struct cg *cg_rlink; /* used for incore cyl groups * 
time_t cg_time; /* time last written */ 
long cg_cgx; /* we are the cgx’th cylinder group 
short eg_ncyl; /* number of cyl’s this cg */ 
short eg_niblk; /* number of inode blocks this cg * 
long eg_ndblk; /* number of data blocks this cg */ 
struct csum cg_cs; /* cylinder summary information */ 
long cg rotor; /* position of last used block */ 
long eg_frotor; /* position of last used frag */ 
long cg_irotor; /* position of last used inode */ 
long cg_frsum [MAXFRAG] ; /* counts of available frags */ 
long . cg_btot [MAXCPG] ; /* block totals per cylinder */ 
short cg_b[MAXCPG] [NRPOS] ; /* positions of free blocks */ 
char cg_iused[MAXIPG/NBBY]; /* used inode map */ 
long cg_magic; /* magic number */ 
u_char cg _free[1]; /* free block map */ 
/* actually longer */ 
bs 
/* 


* MAXBPG bounds the number of blocks of data per cylinder group, 
* and is limited by the fact that cylinder groups are at most one block. 
* Its size is derived from the size of blocks and the (struct cg) size, 
* by the number of remaining bits. 
* 
/ 
#define MAXBPG(fs) \ 
(fragstoblks((fs), (NBBY * ((fs)->fs_bsize - (sizeof (struct cg)))) 


"beg | 
* Turn file system block numbers into disk block addresses. 
* This maps file system blocks to device size blocks. 


ad 
#define fsbtodb(fs, b) ((b) << (fs) ->fs_fsbtodb) 
#define dbtofsb(fs, b) ((b) >> (£s) ->fs_fsbtodb) 
/* 


* Cylinder -group macros to locate things in cylinder groups. 
* They calc file system addresses of cylinder group data structures. 
=} 

#define cgbase(fs, c) ((daddr_t) ((fs)->fs_fpg * (c))) 


nore uae sds 
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#define cgstart (fs, c) \ 


(cgbase (fs, c) + (fs)->fs_cgoffset * ((c) & ~((f£s) ->fs_cgmask) ) ) 


#define cgsblock(fs, c) 
#define cgtod(fs, c) 
#define cgimin(fs, c) 
#define cgdmin(fs, c) 


(cgstart (fs, c) + 
(cgstart (fs, c) + 
(cegstart (fs, c) + 
(cegstart (fs, c) + 


(fs) ->fs_sblkno) /* 
(£8) ->£s_cblkno) /* 
(fs) ->fs_iblkno) 
(fs) ->fs_dblkno) /* 


* Give cylinder group number for a file system block. 
* Give cylinder group block number for a file system block. 


#define dtog(fs, d) 
#define dtogd(fs, d) 


/* 


((a) / (£8) ->£s_fpg) 
((d) % (fs) ->fs_fpg) 


* Extract the bits for a block from a map. 
* Compute the cylinder and rotational position of a cyl block addr. 


aed 


#define blkmap(fs, map, loc) \ 


(((map) [(loc) / NBBY] >> ((loc) & (NBBY-1))) & (Oxff >> (NBBY - 


#define cbtocylno(fs, bno) \ 


((bno) * NSPF(fs) / (fs) ->fs_spc) 


#define cbtorpos(fs, bno) \ 


super bl 
cg block 
inode bl 
lst data 


(fs) ->f 


((bno) * NSPF(fs) % (fs)->fs_nsect * NRPOS / (fs) ->fs_nsect) 


/* 


* The following macros optimize certain frequently calculated 


* quantities by using shifts and masks in place of divisions 


* modulos and multiplications. 
* 
/ 
#define blkoff(fs, loc) /* 
((loc) & ~(fs) ->fs_bmask) 
#define fragoff(fs, loc) /* 
((loc) & ~(f£s) ->fs_fmask) 
#define lblkno(fs, loc) /* 
((loc) >> (fs) ->fs_bshift) 
#define numfrags(fs, loc) /* 
((loc) >> (fs) ->fs_fshift) 
#define blkroundup (fs, size) /* 


#define fragroundup(fs, size) 


calculates (loc % fs->fs_bsize) 
calculates (loc % fs->fs_fsize) 
calculates (loc / fs->fs_bsize) 


calculates (loc / fs->fs_fsize) 


*f\ 
“7 \ 
“7 \ 
or ae 


calculates roundup (size, fs->fs_bsize) * 
(((size) + (f£s)->fs_bsize - 1) & (fs) ->fs_bmask) 


(((size) + (f£s)->fs_fsize - 1) & (fs) ->fs_fmask) 


#define 
#define 
#define 


#define 


/* 


* Determine the number of available frags given a 


fragstoblks(fs, frags) 
((frags) >> (fs) ->fs_fragshift) 
blkstofrags(fs, blks) 
((blks) << (fs) ->fs_fragshift) 
fragnum(fs, fsb) 
((fsb) & ((fs)->fs_frag - 1)) 
blknum(fs, fsb) 
((fsb) &~ ((fs)->fs_frag - 1)) 


* percentage to hold in reserve 


oid 
#define 


freespace (fs, percentreserved) \ 


/* calculates roundup(size, fs->fs_fsize) * 
/* calculates (frags / fs->fs_frag) */ \ 

/* calculates (blks * fs->fs_frag) */ \ 

/* calculates (fsb % fs->fs_frag) */ \ 


/* calculates rounddown (fsb, fs->fs_frag) * 
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(blkstofrags((fs), (fs)->fs_cstotal.cs_nbfree) + \ 
(fs) ->fs_cstotal.cs_nffree - ((fs)->fs_dsize * (percentreserved) / 


/* 
* Determining the size of a file block in the file system. 
* 
/ 
#define blksize(fs, ip, lbn) \ 
(((lbn) >= NDADDR || (ip)->i_size >= ((lbn) + 1) << (fs) ->fs_bshift 
? (f£s)->fs_bsize \ 
: (£ragroundup(fs, blkoff(fs, (ip)->i_size)))) 
#define dblksize(fs, dip, lbn) \ 
(((lbn) >= NDADDR || (dip) ->di_size >= ((lbn) + 1) << (fs) ->fs_bshi 
? (fs)->fs_bsize \ 
(fragroundup (fs, blkoff(fs, (dip) ->di_size)))) 


/* 
* Number of disk sectors per block; assumes DEV_BSIZE byte sector size. 
*/ 

#define NSPB (fs) ((fs)->fs_nspf << (fs) ->fs_fragshift) 

#define NSPF (fs) ((£s) ->£s_nspf) 


#endif /* not _SYS_FS INCLUDED */ 


lg) 
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/* @(#) $Revision: 1.14.61.3 $ */ 
/* $Source: /ws_src/sys.UDL_MERGE_800/ufs/RCS/ino.h,v $ 
* $Revision: 1.14.61.3 §$ 
* $State: Exp $ 
* $Date: 91/11/19 11:21:14 §$ 


*/ 


$SAuthor: rsh § 
SLocker: § 


#ifndef _SYS_INO_INCLUDED /* allows multiple inclusion */ 
#define SYS INO INCLUDED 


struct dinode { 


+ 


union { 
struct 
char 


} di_un; 


struct cinode { 


}; 


#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 


#define 
#define 
#define 
#define 


#endif /* _SYS INO INCLUDED */ 


union { 
struct 
char 


} ci_un; 


di_ic 
di_mode 
di_nlink 


di_uid 


di_gid 
di_size 
di_db 
di_ib 
di_atime 
di_mtime 
di_ctime 
di_ symlink 
di_flags 
di_rdev 
di_pseudo 
di_rsite 
di_blocks 
di_gen 
di_fversion 
di_frptr 
di_fwptr 
di_frent 
di_fwent 
di_fflag 
di_fifosize 
di_contin 


e1..1¢ 
ci_mode 
ci_nlink 
ci_acl 


icommon di_icom; 
di_size [128]; 


icont 


ci_icont; 


ci_size [128] ; 


di_un. 
di_ic. 
di_ic. 
diic; 
di_ic. 
di_ic. 
di_ic. 
di_ic. 
di_ic. 
di_ic. 
di_ic. 
di_ic. 
di_ic. 
di_ic. 
di_ic. 
di_ic. 
di_ic. 
di ic. 
di_ic. 
di_ic. 
di_ic. 
di_ic. 
di_ic. 
di_ic. 
di_ic. 
di_ic. 


ros Rage 6 
C1 iC 
ci_ic. 
ci_un. 


di_icom 
ic_mode 
ic_nlink 
ic_uid 
ic_gid 


ic_size.val [1] 


ic_un2.ic_reg.ic_db 
ic_un2.ic_reg.ic_un.ic_ib 


ic_atime 
ic_mtime 
ic_ctime 


ic_un2.ic_symlink 


ic_flags 


ic_un2. 
ic_un2. 
ic_un2. 


ic_reg. 
ic_reg. 
ic_reg. 


ic_blocks 


ic_gen 
ic_fver 
ic_un2 
ic_un2 
ic_un2 
ic_un2 
ic_un2 
ic_un2 


sion 


.ic_reg. 
.ic_reg. 
.ic_reg. 
.ic_reg. 
.ic_reg. 
-ic_reg. 


ic_db[0] 
ic_db[1] 
ic_db[2] 


ic_contin 


ci_icont 
icc_mode 
icc_nlink 


ic_un. 
ic_un. 
ic_un. 
ic_un. 
ic_un. 
ic_un. 


ci_icont.icc_acl 


ic_fifo. 
1¢: £1f0. 
ic: £ifo, 
ic_fifo. 
ic_fifo. 
ic_fifo. 


if_frptr 
if_fwptr 
if_frent 
if_fwent 
if_fflag 
if_fifosize 


Dlp 
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/* @(#) $Revision: 1.37.61.13 $ */ i el 


/* $Source: /ws_src/sys.UDL_MERGE_ 800/ufs/RCS/inode.h,v §$ 


* $Revision: 1.37.61.13 §$ SAuthor: smp §$ 
* $State: Exp $ $Locker: §$ 

* $Date: 92/05/04 09:28:13 $ 

*/ 


#ifndef _SYS_INODE_INCLUDED /* allows multiple inclusion */ 
#define SYS INODE INCLUDED 


#ifndef _SYS_STDSYMS INCLUDED 
# include <sys/stdsyms.h> 
#endif /* SYS STDSYMS INCLUDED */ 


/* 
* The I node is the focus of all file activity in UNIX. 
* There is a unique inode allocated for each active file, 


* each current directory, each mounted-on file, text file, and the root. 


* An inode is ‘named’ by its dev/inumber pair. (iget/iget.c) 
* Data in icommon is read in from permanent inode on volume. 


*/ 
#include <sys/sem_beta.h> 


#ifndef SITEARRAYSIZE 
#include <sys/sitemap .h> 
#endif /* SITEARRAYSIZE */ 


#include <sys/vnode.h> 
#include <sys/acl.h> 


#define NDADDR 12 /* direct addresses in inode */ 
#define NIADDR 3 /* indirect addresses in inode */ 
/* fifo’s depends on this value */ 
/* if this value changes, look */ 
/* at icommon.ic_un2.ic_reg.ic_un */ 


/* 

* Fast symlinks -- 

symbolic links with paths short than MAX FASTLINK_SIZE 
are stored in the inode where the direct and indirect 
block pointers are normally stored. The flag IC_FASTLINK 
(in i_flags) indicates that the symbolic link is of the 
"fast" variety. 


t+ $+ &£ F F 


* 


* This implementation cannot change, or the filesystem will 
* not be compatible with the OSF/1 "ufs" filesystem. 


*/ 
#define MAX FASTLINK_SIZE ((NDADDR + NIADDR) * sizeof (daddr_t)) 
#define IC_FASTLINK 0x00000001 


struct inode { 
struct inode *i_chain[2] ; /* must be first */ 


struct vnode i_vnode; /* vnode associated with this inode */ 


struct vnode *i_devvp; /* vnode for block i/o */ 
u_int i_flag; 


pays 
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#define 
#define 


dev_t i_dev; /* device where inode resides */ 

ino_t i_number ; /* i number, 1-to-1 with device address */ 
int i_diroff; /* offset in dir, where we found last entry 
struct inode *i_contip; /* pointer to the continuation inode */. 
struct fs *i_fs; /* file sys associated with this inode */ 


struct duxfs *i_dfs; 
struct dquot *i_dquot; /* quota structure controlling this file */ 


the i_rdev here so the remote device stuff can change it 
Still have the real device number around 


dev_t i_rdev; /* if special, the device number */ 
union { 
daddr_t if_lastr; /* last read (read-ahead) */ 
struct socket *is_socket; 
} i_un; 
struct { 
struct inode *if_freef; /* free list forward */ 
struct inode **if_freeb; /* free list back */ 
} a fre. ; 


struct i_select { 
struct proc *i_selp; 
short i_selflag; 

} i_fselr, i_fselw; 


struct locklist *i_locklist; /* locked region list */ 

struct sitemap i_opensites; /* map of sites with file open */ 
struct sitemap i_writesites; /* map of sites writing to file */ 
site_t i_ilocksite; /* site holding ilock */ 

short i_pid; /* pid of last process to lock this inode * 
union 

{ 

struct sitemap is_execsites; /* map of sites executing the file 


struct sitemap is _fifordsites; /* map of sites reading fifo */ 
} i_siteu; 

i_execsites i_siteu.is_execsites 

i_fifordsites i_siteu.is_fifordsites 


struct dcount i_execdcount ; /* # of local process exec the file 
struct dcount i_refcount; /* real and virtual reference count 
struct sitemap i_refsites; /* all other references */ 

struct mount *i_mount; /* mount table entry 


* note this can be calculated as: 
* (struct mount *) 

* (ITOV (ip) ->v_vfsp->v_data) 
* but since this is a relatively 
* frequent operation in DUX, we 
* save it here to make it more 

* 


efficient. 
ay 
union 
{ 
struct icommon 
{ 
u_short ic_mode; /* 0: mode and type of file */ 
short ic_nlink; /* 2: number of links to file */ 


ushort ic_uid; /* 4: owner’s user id */ 
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113 
114 
115 
116 
117 
118 
119 
120 
121 
122 
123 
124 
125 
126 
127 
128 
129 
130 
131 
132 
133 
134 
135 
136 
137 
138 
139 
140 
141 
142 
143 
144 
145 
146 
147 
148 
149 
150 
151 
152 
153 
154 
155 
156 
157 
158 


159 


160 
161 
162 
163 
164 
165 
166 
167 
168 


#ifdef _KERNEL 


#else 
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ushort ic_gid; /* 
quad ic_size; /* 


struct timeval ic_atime;/* 
struct. timeval ic_mtime; /* 
struct timeval ic_ctime;/* 


#endif /* _KERNEL */ 


time_t ic_atime; /* 
long ic_atspare; 

time_t ic_mtime; /* 
long ic_mtspare; 

time_t ic_ctime; /* 
long ic_ctspare; 

union { 

struct { 


daddr_t ic_db[NDADDR] ; 


union { 


daddr_t ic_ib[NIADDR] ; 


struct ic_fifo 


{ 


cpiappacsindnatanay aR ALLAN A MU A 


6: owner’s group id */ 


8: number of 
16: time last 
24: time last 
32: last time 
16: time last 


24: time last 


32: last time 


short if_frptr; 
short if_fwptr; 
short if_frent; 
short if _fwent; 
short if_fflag; 
short if_fifosize; 


} ic_fifo; 


/* 88: 


bytes in file */ 
accessed */ 
modified */ 
inode changed */ 
accessed */ 


modified */ 


inode changed */ 


/* 40: disk block addresses 


indirect blocks * 


char ic_symlink [MAX _FASTLINK_SIZE]; /* 40: short symlin 


} ic_un; 
} ic_reg; 

} ic_un2; 

long ic_flags; /* 

long ic_blocks; /* 

long ic_gen; /* 

long ic_fversion; /* 

long ic_spare [2]; /* 
. ino t ic_contin; /* 

} i_ic; 


struct icont 


{ 


#ifdef KERNEL 


#ifdef HPNSE 


ushort icc_mode; 
short icc_nlink; /* 


100: status */ 


104: blocks actually held */ 
108: generation number */ 
112: file version number */ 


116: reserved, 


currently unused 


124: continuation inode number * 


2: number of 


links to file */ 


/* 4: The optional entries of the 
* access control list 


as 


struct acl_tuple icc_acl [NOPTTUPLES] ; 
#else /* not _KERNEL */ 
struct acl_entry_internal icc_acl [NOPTENTRIES] ; 
#endif /* else not _KERNEL */ 
char icc_spare[46]; /* 82: currently unused */ 

| ae ero 
} i_icun; 


4 


O 
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struct stdata *i_sptr; /* HP-UX NSE, associated stream */ 


#endif 
unsigned char i_ord_flags; /* copied to buf for ordered writes 
}; 
#define L_REMOTE 0x1 /* The process holding the lock is remote * / 
/* NOTE: Watch out for IWANT = 0x10, which is also used as a lock flag */ 
#define NFS_WANTS_LOCK 0x2 /* NFS lock manager is waiting for lock */ 
struct locklist 
{ 
/* NOTE link must be first in struct */ 
struct locklist *11_link; /* link to next lock region */ 
short 11_count; /* xveference count */ 
short 11 flags; /* current flags: L_REMOTE, IWANT, ILB 
union . 
{ struct proc *llu_proc; /* process which owns region */ 
struct . 
{ site_t llur_psite; /* Site where process lives * / 
short llur_pid; /* PID of process */ 
} llu_remote; 
jae ba Sa 


#define ll_proc 11 _u.llu_proc 
#define 11 _psite 11_u.llu_remote.llur_psite 
#define 1l_pid 11_u.llu_remote.llur_pid 


off_t 1l_start; /* starting offset */ 
off _t 1l1_end; /* ending offset, zero is eof */ 
short 1l_type; /* type of lock (for fnctl) */ 
struct inode *1ll_ip; /* Inode owning this locklist */ 

}; 

enum lockf type {L_LOCKF, L_READ, L_WRITE, L_COPEN, L_FCNTL}; 

#define i_mode i_icun.i_ic.ic_mode 

#define i_nlink i_icun.i_ic.ic_nlink 

#define i_uid i_icun.i_ic.ic_uid 

#define i_gid i_icun.i_ic.ic_gid 

#define i_size i_icun.i_ic.ic_size.val [1] 

#define i_db i_icun.i_ic.ic_un2.ic_reg.ic_db 

#define i_ib ; i_icun.i_ic.ic_un2.ic_reg.ic_un.ic_ib 

#define i_atime i_icun.i_ic.ic_atime 

#define i_mtime i_icun.i_ic.ic_mtime 

#define i_ctime i_icun.i_ic.ic_ctime 

#define i_symlink i_icun.i_ic.ic_un2.ic_symlink 

#define i_flags i_icun.i_ic.ic_flags 

#define i_blocks i_icun.i_ic.ic_blocks 

/* Define 1) new name for real device number 2) name for device site # */ 

#define i_device i_icun.i_ic.ic_un2.ic_reg.ic_db[0] 

#define i_rsite i_icun.i_ic.ic_un2.ic_reg.ic_db[2] 

#define i_gen i_icun.i_ic.ic_gen 

#define i_lastr i_un.if_lastr 

#define i_socket i_un.is_socket 

#define i_forw i_chain [0] 

#define i_back © i_chain [1] 

#define i _freef i_fr.if_freef 
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#include <sys/ino.h> 


#endif 


#ifdef 
#ifdef 
extern 
extern 
extern 


extern 
extern 


extern 
extern 
#endif 


#ifdef 
struct 
struct 
int 

extern 
extern 


struct 
struct 
#endif 


struct 
struct 
struct 
struct 
struct 
struct 


ino_t 
#Hendif 


/* _KERNEL */ 
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#define i_freeb i_fr.if_freeb 

#define i _frptr i_icun.i_ic.ic_un2.ic_reg.ic_un.ic_ fifo.if frptr 
#define i_fwptr i_icun.i_ic.ic_un2.ic_reg.ic_un.ic fifo.if_fwptr 
#define i_frent i_icun.i_ic.ic_un2.ic_reg.ic_un.ic_fifo.if_frent 
#define i_fwent i_icun.i_ic.ic_un2.ic_reg.ic_un.ic_fifo.if_fwent 
#define i_fflag i_icun.i_ic.ic_un2.ic_reg.ic_un.ic_ fifo.if_fflag 
#define i_fifosize i_icun.i_ic.ic_un2.ic_reg.ic_un.ic_fifo.if fifosize 
#define i_fifo i_icun.i_ic.ic_un2.ic_reg.ic_un.ic_ fifo 

#define i_fversion i_icun.i_ic.ic_fversion 

#define i_contin i_icun.i_ic.ic_contin 

#define i_acl i_icun.i_icc.icc_acl 

/* 

* Only include ino.h if we are defining _KERNEL. No need otherwise. 

x] 

#ifdef KERNEL 


_KERNEL 
__hp9000s800 
struct inode *inode; /* the inode table itself */ 
struct inode *inodeNINODE ; /* the end of the inode table */ 
int ninode; /* number of slots in the table */ 
struct vnodeops ufs_vnodeops; /* vnode operations for ufs */ 
struct vnodeops dux_vnodeops; /* vnode operations for dux */ 
struct vnode *rootdir; /* pointer to inode of root directo 
struct locklist locklist[]; /* The lock table itself */ 
/* __hp9000s800 * / 
__hp9000s300 
inode *inode; /* the inode table itself */ 
inode *inodeNINODE; /* the end of the inode table */ 
ninode ; /* number of slots in the table */ 
struct vnodeops ufs_vnodeops; /* vwnode operations for ufs */ 
struct vnodeops dux_vnodeops; /* wnode operations for dux */ 
vnode *rootdir; /* pointer to inode of root directo 


locklist locklist[]; /* The lock table itself */ 
/* __hp9000s300 */ 


inode 
inode 
inode 
inode 
inode 
inode 


*ialloc(); 
*iget (); 
*ifind(); 
*owner () ; 
*maknode () ; 
*namei () ; 


dirpref (); 
/* KERNEL */ 


/* flags */ 
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281 #define ILOCKED 0x1 /* inode is locked */ 
282 #define IUPD 0x2 /* file has been modified */ 

283 #define IACC 0x4 /* inode access time to be updated 
284 #ifdef notdef 
285 #define IMOUNT 0x8 /* inode is mounted on */ 
286 #endif 
287 +#define IWANT 0x10 /* some process waiting on lock */ 
288 #define ITEXT 0x20 /* inode is pure text prototype */ 
289 #define ICHG 0x40 /* inode has been changed */ 
290 #ifdef notdef . 
291 #define ISHLOCK 0x80 /* file has shared lock */ 
292 #define IEXLOCK 0x100 /* file has exclusive lock */ 
293 #endif . 
294 #define ILWAIT 0x200 /* someone waiting on file lock */ 
295 #define IREF 0x400 /* inode is being referenced */ 
296 /* change is use DUX !!! */ 
297 #define ILBUSY 0x800 /* lock is not available */ 
298 #define IRENAME 0x1000 /* this inode is the source of a 
299 rename operation */ 
300 #define IACLEXISTS 0x2000 /* An acl exists for this inode */ 
301 *: . 
302 
303 #define ISYNCLOCKED 0x10000 /* inode locked for synchronization 
304 #define ISYNC 0x20000 /* synchronous I/O required */ 
305 #define IDUXMNT 0x40000 /* inode mounted remotely */ 
306 #define ISYNCWANT 0x80000 /* a process waiting on ISYNCLOCKED 

O 307 #define IDUXMRT 0x100000 /* root inode of remotely mounted d 
308 #define IBUFVALID 0x200000 /* incore buffers presumed valid */ 
309 #define IPAGEVALID 0x400000 /* incore exec pages presumed valid 
310 +#define IOPEN 0x800000 /* inode is currently being opened 
311 
312 #define IFRAG 0x01000000 /* fragment was allocated, must ref 
313 
314 #define IHARD 0x2000000 /* hardened inode */ 
315 +#define INOFLUSH 0x4000000 /* for iflush */ 
316 


317 #if defined(__hp9000s800) && !defined(_WSIO) 

318 #define IF_MI DEV 0x08000000 /* dev_t has mgr_index already */ 
319 #else /* __hp9000s800 */ 

320 #define IF_MI DEV 0x00000000 
321 #endif /* __hp9000s800 */ 


/* 5200 doesn’t have mgr_index */ 


322 #define IFRAGSYNC 0x10000000 /* need synch. frag_fit() */ 
323 
324 /* modes */ 
325 #define IFMT 0170000 /* type of file */ 
326 #define IFIFO 0010000 /* fifo */ 
327 #define IFCHR 0020000 /* character special */ 
328 #define IFDIR 0040000 /* directory */ 
329 #define IFBLK 0060000 /* block special */ 
330 #define IFCONT 0070000 /* continuation inode */ 
331 #define IFREG 0100000 /* regular */ 
332 #define IFNWK 0110000. /* network special */ 
333 #define IFLNK 0120000 /* symbolic link */ 
O 334 #define IFSOCK 0140000 /* socket */ 
335 
336 #define ISUID 04000 | /* set user id on execution */ 


aD 
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337 #define ISGID 02000 /* set group id on execution */ 

338 +#define IENFMT 02000 — /* enforced file locking */ 

339 #define ISVTX 01000 /* save swapped text even after use 
340 #define IREAD 0400 /* read, write, execute permissions 
341 +#define IWRITE 0200 

342 #define IEXEC 0100 

343 

344 #define IFIR O01 /* fifo read waiting for write flag 
345 #define IFIW 02 /* fifo write waiting for read flag 
346 #define PIPSIZ 8192 /* £ifo buffer size */ 

347 #define FSEL_COLL 01 /* select collision flag */ 

348 

349 /* for ILOCK and related macros - PA */ 

350 

351 #define DUX_ILOCK (ip) (ip) ->i_ilocksite = u.u_site 

352 

353 #define NFS_ILOCK (ip) (ip) ->i_pid = u.u_procp->p_pid 

354 


355 +#ifdef QFS 

356 #define QFS_ILOCK (ip) record_lock((int) ip) 
357 #define QFS_IUNLOCK(ip) remove_lock((int) ip) 
358 #else /* not QFS */ 

359 #define QFS_ILOCK (ip) 

360 #define QFS_IUNLOCK (ip) 

361 #endif /* not OFS */ 


362 

363 #define ILOCK(ip) { \ 

364 QFS_ILOCK(ip); \ 

365 while ((ip)->i_flag & ILOCKED) { \ 

366 (ip) ->i_flag |= IWANT; \ 

367 sleep ((caddr_t) (ip), PINOD); \ 
368 }\ 

369 (ip) ->i_flag |= ILOCKED; \ 

370 DUX_ILOCK(ip); \ 

371 NFS_ILOCK(ip); \ 

372 } 

373 

374 #define IUNLOCK(ip) { \ 

375 (ip) ->i_flag &= ~ILOCKED; \ 

376 QFS IUNLOCK(ip); \ 

377 if ((ip)->i_flag&IWANT) { \ 

378 (ip)->i_flag &= ~IWANT; \ 

379 wakeup ((caddr_t) (ip)); \ 

380 + \ 

381 } 

382 

383 #ifdef KERNEL 

384 /* 

385 * Convert between inode pointers and vnode pointers 
386 */ 

387 #define VTOI (VP) ((struct inode *) (VP) ->v_data) 
388 #define ITOV(IP) ((struct vnode *) &(IP) ->i_vnode) 
389 

390 /* 

391 * Convert between vnode types and inode formats 
392 * / 


oe) 
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extern enum vtype iftovt_tab[]; 

extern int vttoif_tab[]; 

#define IFTOVT (M) ((((M) &IFMT) == IFNWK) ?VFNWK: ((((M) &IFMT) == IFIFO) 
#define VITOIF (T) (vttoif_tab[(int) (T)]) | 


#define MAKEIMODE(T, M) (VTTOIF(T) | (M)) 


#define ESAME (-1) . /* trying to rename linked files (special) 
#ifdef —_hp9000s300 
#define EREMOVE (-2) /* "source" file of link removed in the 


middle of operation (happens only 
originate from client) */ 

#endif /* —_hp9000s300 */ 

#ifdef —_ hp9000s800 

#define EREMOVE (-2) /* "source" file of link removed in the 

middle of operation (happens only 
originate from client) */ 

#endif /* — hp9000s800 */ 

#define ERENAME (-3) /* the inode being rename’d is in the path 
of another rename operation*/ 

#define EPATHCONF_NONAME (-4) /* The posix standard says that if a user 
requests an unknown name, it should not 
change errorno but should return an erro 
This indicates that is the case. */ 


/* 

* Check that file is owned by current user or user is su. 
* 
/ 

/* We can’t do a straight comparision of (CR)->cr_uid against (IP) ->i_uid. 
* We also need to check the case where we are NFS, and network root (-2) 
* and the inode is owned by "nobody" because i_uid is an ushort and -2 is 
* stored as 65534. 

e/ 
/* name conflict with DIL */ 
#define OWNER_CR(CR, IP) \ 
(( (CR) ->cr_uid == (IP)->i_uid)? 0: \ 
((((CR)->er_uid == -2) && ((IP)->i_uid == (ushort)-2))? 0: \ 
(suser()? 0: u.u_error) )) 


enum de_op { DE_CREATE, DE_LINK, DE_RENAME }; /* direnter ops */ 


#endif /* KERNEL */ 
/* 
_* This overlays the fid structure (see vfs.h). Used mainly in support 
* of NFS 3.2 file handles, the fid structure should contain the minimum 
* information necessary to uniquely identify a file, GIVEN a pointer to 
* the file system. 
nf 
struct ufid { 
u_short ufid_len; 
ino_t ufid_ino; 
long ufid_gen; 
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/* 

* The vnode is the focus of all file activity in UNIX. 

* There is a unique vnode allocated for each active file, 

* each current directory, each mounted-on file, text file, and the root. 


ny 


/* 
* vnode types. VNON means no type. 
*/ ; 
enum vtype { VNON, VREG, VDIR, VBLK, VCHR, VLNK, VSOCK, VBAD, VFIFO, VFNW 


enum vfstype { VDUMMY, VNFS, VUFS, VDUX, VDUX_PV, VDEV_VN, VNFS_SPEC, 
VNFS_BDEV, VNFS_FIFO, VCDFS, VDUX_CDFS, VDUX_CDFS_PV } 


struct vnode { 


u_short v_flag; /* wnode flags (see below) */ 
u_short v_shlockc; /* count of shared locks */ 
u_short v_exlockc; /* count of exclusive locks */ 
u_short v_tcount; /* private data for fs */ 
int v_count ; ; /* reference count */ 
struct vfs *v_vfismountedhere ; /* ptr to vfs mounted here */ 
struct vnodeops *v_op; /* wnode operations */ 
struct socket *vy_socket; /* unix ipe */ 
struct vfs *v_vfsp; /* ptr to vfs we are in */ 
enum vtype v_type; /* vnode type */ 
dev_t v_rdev; /* device (VCHR, VBLK) */ 
caddr_t v_data; /* private data for fs */ 
enum vfstype v_fstype; /* file system type*/ 
struct vas *v_vas; /* vm data structures */ 
vm_sema_t v_lock; /* wnode lock */ 
struct buf *v_ord_lastdatalink; /* for ordered writes */ 
struct buf *v_ord_lastmetalink; - /* for ordered writes */ 
struct buf *y_cleanblkhd; /* clean buffer head */ 
struct buf *yv_dirtyblkhd; /* dirty buffer head */ 
}; 
/* 
* ywnode flags. 
*/ 
#define VROOT 0x01 /* root of its file system */ 
#define VTEXT 0x02 /* vwnode is a pure text prototype */ 
#define VEXLOCK 0x10 /* exclusive lock */ 
#define VSHLOCK 0x20 /* shared lock */ 
#define VLWAIT 0x40 /* proc is waiting on shared or excl. lock */ 
#define VMMF 0x100 /* Vnode memory mapped */ 
/* 
* Operations on vnodes. 
ay 
struct vnodeops { 
int (*vn_open) (__farg) ; 
int (*vn_close) (__farg) ; 
int (*vn_rdwr) (__farg) ; 
int (*vn_ioctl) (__farg) ; 


——_ 


See below for more info 
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57 int (*vn_select) (__farg) ; 
58 int (*vn_getattr) (__farg) ; 
59 int (*vn_setattr) (__farg) ; 
60 int (*vn_access) (__farg) ; 
61 int (*vn_lookup) (__farg) ; 
62 int (*vn_create) (__farg) ; 
63 int (*vn_remove) (__farg) ; 
64 int (*vn_link) (__farg) ; 

65 int (*vn_rename) (__farg) ; 
66 int (*vn_mkdir) (__ farg) ; 

67 int (*vn_xrmdir) (__farg) ; 

68 int (*vn_readdir) (__farg) ; 
69 int (*vn_symlink) (__farg) ; 
70 int (*vn_readlink) (__farg) ; 
71 int (*vn_fsync) (__farg) ; 

72 int (*vn_inactive) (__farg) ; 
73 int (*vn_bmap) (__ farg) ; 

74 int (*vn_strategy) (__ farg) ; 
75 int (*vn_bread) (__farg) ; 

76 int (*vn_brelse) (__farg) ; 
77 int (*vn_pathsend) (__farg) ; 
78 int (*vn_setacl) (__farg); 
79 int (*vn_getacl) (__farg) ; 
80 int (*vn_pathconf) (__farg) ; 
81 int (*vn_fpathconf) (__farg) ; 
82 /* 

83 * Add VOPs. for support NFS 3.2 file locking. 
84 */ 

85 int (*vn_lockctl) (__ farg) ; 
86 int (*vn_lockf) (__farg) ; 

87 int (*vn_fid) (__farg) ; 

88 int (*vn_fsctl) (__ farg) ; 

89 int (*vn_prefill) (__farg) ; 
90 int (*vn_pagein) (__farg) ; 
91 int (*vn_pageout) (__farg) ; 
92 int (*vn_dbddup) (__farg) ; 
93 int (*vn_dbddealloc) (__ farg) ; 
94 }; 

95 

96 

97 #ifdef KERNEL 

98 


99 #define VOP_OPEN(VPP,F,C) 


100 #define VOP_CLOSE(VP,F,C) 

101 #define VOP_RDWR(VP,UIOP,RW,F,C) 
102 #define VOP_IOCTL(VP,C,D,F,CR) 

103 #define VOP_SELECT(VP,W,C) 

104 

105 +#define VOP_GETATTR(VP,VA,C,S) 

106 #define VOP_SETATTR(VP,VA,C,N) 

107 #define VOP_ACCESS(VP,M,C) 

108 #define VOP_LOOKUP (VP, NM, VPP,C,MVP) 
109 #define VOP_CREATE (VP, NM, VA, E,M, VPP,C) 
110 

111 #define VOP_REMOVE (VP, NM, C) 

112 #define VOP_LINK(VP,TDVP,TNM,C) 


(* (* (VPP) ) ->v_op->vn_open) (VPP, F, C) 
(* (VP) ->v_op->vn_close) (VP,F,C) 

(* (VP) ->v_op->vn_rdwr) (VP,UIOP,RW,F,C) 
(* (VP) ->v_op->vn_ioctl) (VP,C,D,F,CR) 
(* (VP) ->v_op->vn_select) (VP,W,C) 


/*An additional parameter specifying synchronization has been added to getattr 


(* (VP) ->v_op->vn_getattr) (VP,VA,C,S) 
(* (VP) ->v_op->vn_setattr) (VP, VA,C,N) 
(* (VP) ->v_op->vn_access) (VP,M,C) 
(* (VP) ->v_op->vn_lookup) (VP,NM 
(*(VP)->v_op->vn_create) \ 
(VP,NM, VA,E,M,VPP,C) 
(* (VP) ->v_op->vn_remove) (VP,NM,C) 
(* (VP) ->v_op->vn_link) (VP, TDVP, TNM,C) 


Be 


12:16 1993 edited 9.0 vnode.h Page 3 


#define VOP_RENAME (VP,NM, TDVP, TNM, C) (* (VP) ->v_op->vn_rename) \ 

(VP ,NM, TDVP, TNM, C) 
#define VOP_MKDIR (VP, NM, VA, VPP, C) (* (VP) ->v_op->vn_mkdir) (VP,NM,VA,VPP,C 
#define VOP_RMDIR(VP,NM,C) (* (VP) ->v_op->vn_rmdir) (VP,NM,C) 
#define VOP_READDIR (VP, UIOP,C) (* (VP) ->v_op->vn_readdir) (VP,UIOP,C) 
#define VOP_SYMLINK (VP, LNM, VA, TNM, C) (* (VP) ->v_op->vn_symlink) \ 

(VP, LNM, VA, TNM, C) 
#define VOP_READLINK (VP,UIOP,C) (* (VP) ->v_op->vn_readlink) (VP,UIOP,C) 
#define VOP_FSYNC(VP,C, S) (* (VP) ->v_op->vn_fsync) (VP,C, 
#define VOP_INACTIVE (VP,C) (* (VP) ->v_op->vn_inactive) (VP,C) 
#define VOP_BMAP(VP,BN, VPP, BNP) (* (VP) ->v_op->vn_bmap) (VP, BN, VPP, BNP) 
#define VOP_STRATEGY (BP) (* (BP) ->b_vp->v_op->vn_strategy) (BP) 
#define VOP_BREAD (VP,BN, BPP) (* (VP) ->v_op->vn_bread) (VP, BN, BPP) 
#define VOP_BRELSE (VP,BP) (* (VP) ->v_op->vn_brelse) (VP,BP) 


#define VOP_PATHSEND (VPP, PNP, FOLLOW, NLINKP, DIRVPP,COMPVPP,OPCODE,DEPENDENT) \ 
( (* (* (VPP) ) ->v_op->vn_pathsend) ? \ 
(* (* (VPP) ) ->v_op->vn_pathsend) \ 
(VPP, PNP, FOLLOW, NLINKP , DIRVPP, COMPVPP,OPCODE,DEPENDENT) : \ 
(panic ("VOP_PATHSEND") , EINVAL) ) 
#define VOP_SETACL(VP,NT,BP) (* (VP) ->v_op->vn_setacl) (VP,NT,BP) 
#define VOP_GETACL (VP,NT,BP) (* (VP) ->v_op->vn_getacl) (VP,NT,BP) 
#define VOP_PATHCONF (VP,NT,BP,CR) (* (VP) ->v_op->vn_pathconf) (VP,NT,BP,CR 
#define VOP_FPATHCONF (VP, NT, BP, CR) (* (VP) ->v_op->vn_fpathconf) (VP,NT,BP,C 


™~s 
* 


VOPs for NFS 3.2 file locking. Ours are different because we support 
local file locking already in the kernel. VOP_LOCKCTL() is called from 
fentl() to process a lock request. We have an extra parameters because 
the lower level routines will need the file structure for the file 

being locked. The Lower Bound and Upper Bound are passed in because the 
higher level routine already computed them for error checking. This means 
that ALL functions calling these routines MUST include reasonable values 
for LB and UB. Also, Sun does not have a VOP_LOCKF() because they 
emulate lockf() as a library on top of fentl(), instead of two separate 
system calls like ours. 
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* 
/ 
#define VOP_LOCKCTL(VP,LD,CMD,C,FP,LB,UB) (* (VP) ->v_op->vn_lockctl) \ 
(VP,LD,CMD,C,FP, LB, UB) 
#define VOP_LOCKF (VP,CMD,SIZE,C,FP,LB, UB) (* (VP) ->v_op->vn_lockf) \ 
(VP,CMD,SIZE,C,FP, LB, UB) 

/* 

* Support for NFS 3.2 file handles. Given a vwnode pointer, generate 

* a "file id" which can be used to recreate the vnode later on. 

* 

/ 
#define VOP_FID(VP, FIDPP) (* (VP) ->v_op->vn_fid) (VP, FIDPP) 
#define VOP_FSCTL(VP, COMMAND, UIOP, CRED) (*(VP)->v_op->vn_fsctl) \ 
(VP, COMMAND, UIOP, CRED) 
#define VOP_PREFILL(VP, PRP) (* (VP) ->v_op->vn_prefill) (PRP) 
#define VOP_DBDDUP (VP, DBD) (* (VP) ->v_op->vn_dbddup) (VP, DBD) 
#define VOP_DBDDEALLOC(VP,DBD) \ 
(( (VP) ->v_op->vn_dbddealloc) ? (* (VP) ->v_op->vn_dbddealloc) (VP, DBD) :1) 
#define VOP_PAGEOUT (VP, PRP,START,END,FLAGS) \ 
(* (VP) ->v_op->vn_pageout) (PRP, START, END, FLAGS) 


#define VOP_PAGEIN (VP, PRP, WRT, SPACE, VADDR, START) \ 
(* (VP) ->v_op->vn_pagein) (PRP, WRT, SPACE, VADDR, START) 


ot 
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169 

170 /* 

171 * flags for above 

172 * / 

173 #define IO UNIT 0x01 /* do io as atomic unit for VOP_RDWR * 
174 #define IO APPEND 0x02 /* append write for VOP_RDWR */ 
175 #define IO_SYNC 0x04 /* sync io for VOP_RDWR */ 

176 

177 #endif /* _KERNEL */ 

178 

179 ~/* 


180 * Vnode attributes. A field value of -1 

181 * represents a field whose value is unavailable 

182 * (getattr) or which is not to be changed (setattr) . 
183 * / 

184 /*DUX MESSAGE STRUCTURE*/ 

185 struct vattr { 


186 enum vtype va_type; /* vwnode type (for create) */ 

187 u_short va_mode ; /* files .access mode and type */ 

188 u_short va_uid; /* owner user id */ 

189 u_short va_gid; /* owner group id */ 

190 /*moved va_nlink for alignment*/ 

191 short va_nlink; /* number of references to file */ 

192 long va_fsid; /* file system id (dev for now) */ 

193 long va_nodeid; /* node id */ 

194 u_long va_size; /* file size in bytes (quad?) */ 
O 195 long | va_blocksize; /* blocksize preferred for i/o */ 

196 struct timeval va_atime; /* time of last access */ 

197 struct timeval va_mtime; /* time of last modification */ 

198 struct timeval va_ctime; /* time file ‘‘created */ 

199 dev_t va_rdev; /* device the file represents */ 

200 long va_blocks; /* kbytes of disk space held by file * 

201 site_t va_rsite; /* site the device file represents */ 

202 site_t va_fssite; /* file system site (dev site ) */ 

203 dev_t va_realdev; /* The real devcie number of device 

204 containing the inode for this file 

205 u_short va_basemode ; /* the base mode bits unaltered */ 

206 u_short va_acl:1, /* set if optional ACL entries */ 

207 va_fstype:3, 

208 ; 2:12; 

209 }; 

210 

211 /* 

212 * Modes. Some values same as Ixxx entries from inode.h for now 

213 * / 

214 #define VSUID 04000 /* set user id on execution */ 

215 #define VSGID 02000 /* set group id on execution */ 

216 #define VSVTX 01000 /* save swapped text even after use */ 

217 #define VREAD 0400 /* read, write, execute permissions */ 


218 +#define VWRITE 0200 
219 #define VEXEC 0100 
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J. Wakerly 
26 February 1982 


The following description has appeared in a number of informal publications 
2f computer users, and has been variously attributed to Jeff Berryman, Bruce 
VanAtta, and probably others as well. I’m not sure who the original author 


is, 
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seam 
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but read, understand, and enjoy. 


The Paging Game -- Rules 
Each player gets several million things. 


Things are kept in crates that hold 4096 things each. Things in the 
same crate are called crate-mates. 


Crates are stored either in the workshop or the warehouse. The workshop 
is almost always too small to hold all the crates. 


There is only one workshop but there may be several warehouses. 
Everyone shares them. 


Each thing has its own thing number. 

What you do with a thing is to zark it. Everyone takes turns zarking. 
You can only zark your things, not anyone else’s. 

Things can only be zarked when they are in the workshop. 


Only the Thing King knows whether a thing is in the workshop or ina 
warehouse. 


The longer a thing goes without being zarked, the grubbier it is said 
to become. 


The way you get things is to ask the Thing King. He only gives out 
things in multiples of eight. This is to keep the royal overhead down. 


The way you zark a thing is to give its thing number. If you give the 
number of a thing that happens to be in a workshop it gets zarked 
right away. If it is in a warehouse, the Thing King packs the crate 
containing your thing back into the workshop. If there is no room in 
the workshop, he first finds the grubbiest crate in the workshop, 
whether it be yours or somebody else’s, and packs it off with all its 
crate-mates to a warehouse. In its place he puts the crate containing 
your thing. Your thing then gets zarked and you never know that it 
wasn’t in the workshop all along. 


Each player’s stock of things have the same numbers as everybody 
else’s. The Thing King always knows who owns what thing and whose 
turn it.is, so you can’t ever accidentally zark somebody else’s thing 
even if it has the same thing number as one of yours. 


Notes 


Traditionally, the Thing King sits at a large, segmented table and is 
attended to by pages (the so-called "table pages") whose job it is to 
help the king remember where all the things are and who they belong to. 


Rules 9 and 12 free players to concentrate on zarking their things, 
letting the King do the worrying about where the things are located. 


-One consequence of Rule 13 is that everybody’s thing numbers will be 
similar from game to game, regardless of the number of players. 


The Thing King has a few things of his own, some of which move back and 
forth between workshop and warehouse just like anybody else’s, but some 
of which are just too heavy to move out of the workshop. 


With the given set of rules, oft-zarked things tend to get kept mostly 
in the workshop, while little-zarked things stay mostly in a warehouse. 
This is efficient stock control. 


Sometimes even warehouses get full. The Thing King then has to 

start piling things on the dump out back. This makes the game slower 
because it takes a long time to get things off of the dump when they 
are needed in the workshop. A forthcoming change in the rules will 
allow the Thing King to select the grubbiest things in the warehouses 
and send them to the dump in his spare time, thus keeping the 
warehouses from getting too full. This means that the most 
infrequently-zarked things will end up in the dump so the Thing King 
won’t have to get things from the dump so often. This should speed up 
the game when there are a lot of players aac the warehouses are 
getting full. 


Every player is a winner in the paging game despite the apparent 
autocratic nature of the King. 


LONG LIVE THE THING KING! 


Virtual Memory 


Why? 


How? 
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Memory Management 


allow for (fairly) efficient stretching of memory 


allow all programs to think they are running by themselves 
by providing virtual address space for each process 


There will always be swap space reserved for a process’ 


memory ; 


it may or may not have enough physical RAM for 


all it is doing. 


Pageout 


daemon kicks out pages if we’re running short and 


they aren’t being referenced often enough; swapper kicks 
out whole processes if we’re *really* getting short. 


Virtual 


- 68K 


address translation 


32 bit address 

10 bits tell which segment table entry 

10 more tell which page table entry (pte) 

12 bits for offset into 4k page 

pte has 20 bit physical address (of 4k page) and 
has 12 bits left over for protection information, 


flags, etc. 


68040 requires 3-level tables, but the idea is 
the same. 


system shares *large* virtual address space; 
each process gets 4 1GB chunks of it; 


when there is a TLB miss, the system will use 
the PDIR (reverse page table) to resolve the 
address 
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Foundation Principles 


Lots of things will be shared; the VM system should 
encourage this by making it efficient: 


- copy-on-write - allows for efficient fork(), etc 


- shared libraries; allow sharing of text at granularity 
of library rather than a.out 


A process address space is nothing more than a bunch of 
collections of pages (abstracted as pregions/regions). 


Machine independence: 


- the bulk of the VM system is shared between 300/400 
and 700/800 - the Hardware-Independent Layer ("HIL"). 


- the parts specific to one or the other are well 
compartmentalized and there are clean interfaces 
to this code - the Hardware-Dependent Layer ("HDL"). 


The bulk of the system should deal in pages, but shouldn’t 
know much about them - all the HIL knows is that pages are 
NBPG bytes in size and it can get at them via pfdat[]. 


O 


O 


Regions 


Pregions 
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Regions are the building blocks for the whole VM system. 


A region is a logically contiguous set of pages that are 
used for *one* thing such as stack, text, shared lib, etc. 


Regions contain (among other things) 
- the type of this region (unused, private, or shared) 
- the number of pages in this region 
- the number of physical pages in this region 


- "disk block descriptors" - tell where the data can be 
paged/swapped to; one for each page in the region 


- a vnode * that tells which device/filesystem the data 
in this region came/comes from 


= uw " goes to 


(The vnodes tell *which* device/filesystem; the DBDs 
tell *where* on that t Mh. ht Ae ) 


aphook  obeh poly dubil alle qe aan tut 


A pregion can be thought of as a connection between a region 
and a process. 


Note that in the region data structure there is no place for 
things like the virtual address at which the region is mapped; 
this is because regions are system-wide structures, and that 
sort of information is per-process. To connect regions to 
processes, we use structures called pregions. Some of the 
more important fields in a pregion: | 


- pointers to the pregions on either side 
- a pointer back to the vas 


- the type of this pregion (text, data, stack, mapped file, 
I/O, shared memory, etc) 


- the virtual address (in the process’ address space) this 
pregion is mapped to 


a count of the number of pages this pregion is mapping 


a pointer to the region 
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Per-process VM Structures 


A process’ memory map is represented by something called a 
"vas" (virtual address space), which is little more than a 
doubly-linked list of pregions. A typical process will have 
4 "normal" pregions as well as some extra ones.... 


pregion 9 - u area system 4GB 
_overhead__ 
pregion 8 - stack user stack . 
ae <-- top of stack 
| \/ 
= => 
pregions 3-7 ---> 
shared libs ---> 
(mapped files) ---> 
aS 
f 
----------- <-- top of data segment 
user 
bss/data 
O pregion 2 - data 
text 
pregion i - text 0 


We said above that a process’ address space was represented by a "vas". 
For any process, there is a pointer in its "proc structure" that points 
to its vas. The vas has several things in it, most notably 


- a pair of pointers to a doubly linked list of pregions, 
sorted by where they are in the process’ address space 


- a pointer to hardware-dependent structures (such as the 
segment/page tables for 680x0) 


Note the hierarchy: each process has its own vas, which gets us 

to the pregions, which point at (system-wide) region structures. 
All of this is for the kernel; the MMU still uses segment and page 
tables to do (virtual ---> physical) translation. 


eee _ F - ; at Sty ale ey 3 seni “ pistes con aie 3 . 


O 
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When To Do What 


Available Memory 


If the amount of free physical 
memory stays up here, life is 
wonderful. If it falls 
down 
here, 
though, 
we’re in 
trouble... 


min(512K, 25% of user memory) 
lotsfree +¢------- cc rrr cer reenter + 
pageout daemon runs below here 
scans pages and may page 
a few out 


min(200K, 12.5% of user memory) 
desfree foc ee ee eee ee ee ee eee eee + 
Swapper will run below here, and 
vhand will try harder 


Lf as a 


min(64K, desfree/2) 

minfree +------------ wee ee ee eee eee ee ee - -----e + 
Swapper will force active processes 
out below here 


Note: these numbers may change from release to release; the 
general idea is likely to be around for a while. 
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The Paging Game 


- A (somewhat) graceful way of stretching the amount of 
available memory. 


- Implemented with a clock algorithm: 


- "age hand" goes around at a calculated rate, aes 
pages by clearing their reference bits 


- if the process accesses the page, the reference bit will 
be set again 


- if the "steal hand" comes around and the reference bit is 
still clear, the page is likely to get kicked out 


- if the process accesses a page that has been "kicked 
out" but hasn’t been given to someone else yet, a "soft" 
page fault occurs and the page can be reclaimed 


- the "hands" only look at active pregions; this way no > 
time is wasted looking at physical pages that can’t be 
paged out (i.e. a driver grabs some memory; that memory 
can’t be paged, so there’s no reason for the kernel to 
look at it) 


- in 8.0 there is a severe problem with this scheme, because 
if we kick out 20 pages in a row, they probably all 7 
came from 1-2 pregions, and those were probably from 
1-2 processes :- ( *** this is fixed in 9.0 *** 


- Speed of hands is calculated to keep overhead <= 10% of CPU time. 


- Pageout. daemon is process 2; doesn’t run at all if more than 
"lotsfree" memory available. 


The pager views memory as if it was around the face of a clock. 
For our purposes, we’ll unroll the clock and look at it asa 
straight line (numbers above each pregion indicate its size): 


50 140 30 40 95 
| xX text | X data | xX stack | mwmtext | = wmwm data... 


In 9.0 the pager will look through 1/16 of each pregion’ Ss pages 
at a time, so it will go around the whole "clock" 16 times to 
visit all of the eligible memory. 
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The Pageout Daemon ("vhand"; process 2) 


rm? 


loop: 


pages to scan = maxmem/scanrate/tune.t vhandr #1 


if we have plenty of RAM #2 
pages_to_free = 0 
else 


pages _to_ free desfree - freemem 


if pages_to_free > 0 
dado 
look for pageable pregion; if found #3 
get _pageout routine from appropriate 
filesystem to steal the pages; normally 
this will be the "devswap" filesystem 


while we haven’t yet stolen pages_to free pages 


while pages to_scan > 0 
find an "ageable" pregion (one that’s not locked right now) 


clear ref bits for its pages, starting where we left off 
last time and dropping pages_to_scan appropriately 


goto loop 


Notes 


"maxmem" is basically the number of pages the kernel didn’t take 
at boot time; "Scanrate" is the number of seconds it should take 
to go around the clock, assuming that vhand shouldn’t take too 
much of the system’s time and that it should run faster/slower 
depending on how much memory is currently free; "tune.t_vhandr" 
tells how many times per second to run vhand - it is part ofa 
larger structure that controls the pager’s operations 


The fact that the pager is running means the system is short of 
memory; how short it is will govern whether we actually steal 
pages or not 


The pager doesn’t want to know about devices, so it hides behind 
the vnode layer; when it wants to page out some of a pregion’s 
pages, it calls the filesystem associated with the region; this 
would normally be the pseudo-filesystem "devswap" (which only has 
pagein/pageout routines) j 


fh 


Swapping 
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A cumbersome way of stretching the amount of available memory. 


Can consume lots of the system’s resources. 


Kick out whole process at a time, not just part of it. 


Only happens when we are really worried about the amount of 
memory available. 


If the swapper runs much at all, the system is underconfigured. 


The basic plan is to kick out junk; if that fixes the problem, 
we’re OK. Only as a last resort will an active process get 
swapped out. 


Deactivation (new in 9.0) 


move the process to a priority that the scheduler will 
ignore (keep it from running, period) 


let the pager steal its pages 


Swap out the u area & kernel stack, since the pager 
is not allowed to touch those 


motivation is to keep from overloading the system 
with swap traffic (pager is much nicer to system than 
swapper) 


a 
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Process 0: The Swapper Tt. 


loop: 
if ((>= 2 runnable procs) and (very short of RAM) ) 
goto hardswap 


walk through proc table, switching on p stat { 


case runnable but swapped out: 
if this guy is the highest priority we’ve seen 
remember him 


case sleeping or stopped: 
if this guy is dead in the water 
kick him out 


} 


if nobody wants in 
sleep until we’re needed 
goto loop 


if it’s not critical to bring someone in 
wait awhile 


| goto loop 
else 


try to swap most important process in (usually works) 
if it worked, goto loop 


hardswap: 
walk through proc table { 


if process isn’t swappable or is a zombie 
skip it 


if (proc. is stopped) or (has slept awhile at int’ible pri) 
if it has slept longer than anyone we’ve seen 
remember it — 
else if (don’t have sleeper yet) and (it’s runnable|asleep) 
see how big it is 
if it’s one of the biggest we’ve seen 
remember it 


} 


if we didn’t find a long sleeper 
pick "oldest" big job (based on nice value and time in-core) 


if (found a sleeper) or (desperate and found *someone* to swap) or 
(someone needs in and someone else has been in for awhile) 
if we’re desperate 
fake like we’re still short on memory 
try to swap this guy out (will usually succeed) 
O , goto loop 


wait awhile and then goto loop 


/?. 
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Swap Space Allocation/Management . An 


- Space 


- Space 


- Space 


: . ea 
allocation - per region yl 


: if 
i 13g 
ay 


y- 
- A page of swap is reserved for each page of the 


region (assuming it is a data/stack sort of region). 


The number of pages of swap available to reserve 
is in a kernel global variable called "swapspc_cnt"; 
the maximum is in "swapspc_max". 


Space won’t be allocated until we need it; at that 
point, an address (really indices into the 
swaptab[]/swapmap[] below) will be put into the DBD 
for the page. 


allocation - shared objects 


Shared text can be released if no processes are 
using it; note that it is not swapped; we just arrange 
to fault it in when it is referenced again. 


Shared memory can be swapped out if no processes are 
using it (implying that doing constant 
shmat (2)/shmdt(2)s is a bad idea). 


allocation - system-wide 


"Swaptab" is an array with MAXSWAPCHUNKS entries, 
each corresponding to 2 MB (default - parameter is 
named "sSwchunk" and it defaults to 2048 (1k units) ) 
of swap space. 


The major component of a swaptab[] entry is an array 
called "Swapmap" - it has an entry in it for each page 
of space in this chunk. 


"swdevt" is an array, one element per disk that has 
Swap space on it. It is in /etc/conf/conf.c. 


If the swap space is spread over >1 disk, the 

space is taken from equal-priority disks ina 
round-robin fashion. Device swap is regarded as 

a higher priority than filesystem swap, fora 

given priority (e.g. device swap at priority 5 will 
get used before fs swap at priority 5 which will get 
used before device swap at 6) 


Filesystem swap is normally allocated from the 
filesystem when it is needed and returned when not; 
exception if system manager specifies a minimum 
amount to take (and keep). 


Note that we never guarantee contiguous chunks, but 
will certainly accept them :-) 


(A) 
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Swap Space Allocation/Management 


swaptab 
toe - eee ---eee +\ 
| 2 MB | 
“rere ere ew ew me em wm em em ==> 
| 2 MB | 
| |\ 
| ; | <=> 
ca phd tac eee as anaes tees Ss / 
| : Re: 
| | ==> 
t-- ec cee rece nr ee + 
| [\ 
foe eee eee eee + \ 
| | ==> 
t------ ee ---- ee + 
| @: 
fro ---- oe eee +/ 
| swdev_pri 
toro ce eee ee roe + 
| 7937; prio | ---> 
tee eee ee ee err + 
torr cc rrr + 
| 7945; pril | 
t---- ee ee eee + 
swaptab 

CS eietietaliediaditiatiatatatetetates + 

2 MB ---> 
t---- ee ee ee nee + 


In a region, each page 
has been pushed out to 
into the swaptab[] and 


+ 
poe nee ee- + wees | 
| 7959; 4 MB | <--/ + 
Hoe oe eee eee + /----| 

+ 

=| 
force ee ne eee + + 
| 7945; 6 MB | <--/ 
pone eee eee + 
foe e eee ee- + 
| 7937; 2 MB | «<----- / 
fone eee eee + + 
oe 
poner e eee eee + | + 
| 7958; fs | «<---/ 
poorer ee eeee- + 
perce eee e eee eee + 
| 7959; prio | 
bowen eee eee ee eee + 

+ 

| 

+ 


swdevt 
eeeeeeeww se © ow wo eo @ & + 
| 
-—aweewr ween wo ewe eo ow + 
| 
—sa mem enw we ew ew ew eee + 
| 
ee ee + 
fswdevt 
wm ewe ww ww www ew ee + . 
| 
“ses we ww ew ee & @ ew ew we + 
Fswdev_pri 
Oe + 
7958; prio | 
meee ew ewe we we ew ew ew ew ew + 


each swaptab entry points at a swapmap[], 
Beater is ni eee of Poe, es 


ucture fo 


oO (y (} {} {} {3 {3 “0 "OO 


use count and 


LPs 


worth of 


Oi 0 “i 


ptr to next free entry 


will have a VFD and a DBD. 
the DBD will have an index 


the swap area, 


When a page 


an index into that entry’s swapmap[]. 


/4} 


) 
Important 
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Data Structures 


pfdat - used to keep track of physical memory. There’s an entry 
in it for each page of non-kernel memory. The structure is 
defined in sys/pfdat.h. 


68K: 
- Segment table - one for each process. Each table has 
1024 entries, each of which can point at a page table 
(or block table, if 3-level tables are being used). The 
structure for these tables is in machine/pte.h. 
- Page table - 1024 entries, each of which can point toa 
4K page of RAM. 
swdevt[] - an array of structures, one element per disk that 


has swap on it; the structure contains things like where the 
swap starts, how many blocks are there, etc. There isa 


“similar structure called "fswdevt" for filesystem swap. 


swaptab[] - an array of structures, one for each 2MB (default) 
of swap space. It is sized by the kernel parameter MAXSWAPCHUNKS, 
and each entry points at a swapmap[]... 


. Swapmap[] - (not related to pre-8.0 swapmap!) - an array that 


hangs off of a swaptab[] entry; there is an entry in a swapmap[] 
for each page of swap space in the (by default) 2MB chunk. The 
entries consist of a use count and a pointer to the next free 
entry in the swapmap. 


swdev_pri[] - an array of prioritized pointers to swap disks; 
each disk that is at a particular swap priority has an entry 
in swdev_pri[that_priority] 


vmmeter and vmtotal - see the respective header files for these 


structures; they have important summary information that things 
like top and monitor display 


[FA 


O 
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Tunable Parameters 

- maxdsiz, maxssiz, maxtsiz - maximum sizes of the respective 

parts of a process. There is no built-in "cost" for raising 
these parameters - they are here as sanity checks. 

- minswapchunks - minimum amount of swap for a diskless node. It 
is always allocated to the node (this applies to other systems 
as well, but is primarily an issue for diskless systems that 
get their swap from a server). 

- maxswapchunks - maximum amount of swap space a system is allowed 
to allocate; note that this is enforced on the node itself, 
not by the diskless server ==> each system has its own value 

- nswapdev - no. of entries in swdevt[]; if this number is more 
than the number of "swap..." lines in the dfile, there will 
be room for dynamic swapon(lm) commands after boot time. 


- swchunk - size of chunk in swaptab[] - defaults to 2048 fhy” 
which means 2MB 


: unlockable_mem - amount of RAM that can not be locked 
Note that other parameters (such as nbuf) can have an effect on 
the VM system (what if nbuf was 1024 on an 8MB system?) 
Kernel Variables Of Interest 
- _max?siz from above; all integers 
- segment and page tables - see pte.h 
- _lotsfree, desfree, minfree - integers used by pageout daemon 


- _freemem - integer used by pager to keep track of free memory 


_Swdevt - array defined in conf.c 


le 
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“ Summary 


A process’ memory map is represented by something called a 
"vas" (virtual address space), which is little more than a 
doubly-linked list of pregions. A typical process will have 
4 "normal" pregions as well as some extra ones.... 


pregion 9 - u area | overhead | 4GB 
pregion 8 - stack user stack 
saat aa <-- top of stack 
\/ 
aS 
pregions 3-7 ---> this diagram is a 68K view; 
shared libs ---> though the 700 is different, 
(mapped files) ---> the same vas/pregion 
---> structure is used. 
¥) it 
ietieietatatatatatiatl <-- top of data segment 
user 
bss/data 
pregion 2 - data | 
pregion 1 - text __ text 0 


O 


- Virtual-to-physical mapping is handled by the MMU, with the 
aid of per-process segment and page tables. The first part 
of the address indexes into the segment table, the next 
indexes into the page table that the STE pointed at, and the 
last piece is a 12-bit index into the page. 


- "Regions" are groups of pages that are all of the same type 
(e.g. text, stack, etc), and the system is set up to allow easy 
sharing of them. If a process is using a particular region, it 
will have a "pregion" that points to the region and tells 

: where in the process address space that region is mapped. 


- "Paging" refers to kicking out individual pages (loosely based 
on frequency of use) and then faulting them back in if needed; 
"Swapping" refers to kicking out and bringing in whole processes. 
Paging is a much gentler way to stretch the amount of memory. 


Swap space is reserved whenever a process starts (via fork/exec 
or grows (via malloc (==> sbrk/brk)); it is actually allocated 
to a particular page in a region when that page is about to 
get swapped/paged. It is mapped via DBDs in the region; these 
index into the swaptab[]/swapmap[] structure for the system. 


WWI HU bh WD BP 


From 9.0 /etc/conf/h/pregion.h: 


"7 


Each process has a number of pregions which describe the 
regions which are attached to the process. 


struct p_lle { 


struct 


pregion *lle_next; 


/* First pregion in list */ 


struct pregion *lle_prev; /* Last pregion in list */ 
10 : 
11 
12 typedef struct pregion { 
13 struct p_ lle p_1l; /* Linked list of pregions in vas */ 
14 #define p_next p_11.1lle_next 
15 #define p_prev p_1ll.lle_prev 
16 short p flags; 
“7 short p_type; 
18 reg t *p_reg; /* Pointer to the region. */ 
19 space_t p_ space; /* virtual space for region */ 
20 caddr_t p_vaddr; /* virtual offset for region */ 
21 size t p_off; /* offset in region */ 
22 size_t p_count; /* number of pages mapped by pregion */ 
23 short p_prot; /* protection ID of region */ 
24 ushort p_ageremain; /* remaining number of pages to age */ 
25 Size_t p_agescan; /* index of next scan for vhand’s age hand * 
26 size_t p_stealscan; /* index of next scan for vhand’s steal hand 
27 struct vas *p_ vas; /* Pointer to vas we’re under */ 
28 struct pregion *p_forw; /* Active chain of pregions */ 
29 struct pregion *p_back; 
30 struct pregion *p_prpnext; /* list of pregions off region */ 
31 struct pregion *p_prpprev; /* list of pregions off region */ 
32 size_t p_lastfault; /* last page faulted by this pregion */ 
O 33 size_t p_lastpagein; /* last page-in scheduled for this pregion * 
34 short p_trend_diff; /* difference between last two page faults * 
35 ushort p_trend_strength;/* number of times p_trend_diff was the same 
36 struct hdlpregion p_hdl;/* HDL specific info for pregion */ 
37 } preg_t; 
38 
39 /* Pregion flags. 
40 * / 
41 ; 
42 #define PF_ALLOC 0x0001 /* Pregion allocated * / 
43 #define PF_MLOCK 0x0002 /* region is memory locked * / 
44 #define PF EXACT 0x0004 /* map pregion exactly */ 
45 #define PF_ACTIVE 0x0008 /* Pregion on active chain * / 
46 #define PF_NOPAGE 0x0010 /* Pregion locked against paging */ 
47 /* either another pregion is * / 
48 /* responsible for paging this */ 
49 /* region or we don’t want it * / 
50 /* paged (UAREA and NULLDREF) * / 
51 #define PF_NOMAP 0x0020 /* Translations should not be */ 
52 /* resolved through this preg * / 
53 /* by HIL code (for priveleged */ 
54 /* shared libraries) . * / 
55 #define PF_PUBLIC 0x0040 /* May be public (for shared * / 
56 /* libraries) ed 
57 #define PF_DAEMON 0x0080 /* pregion is for kernel daemon */ 
58 #define PF_WRITABLE 0x0100 /* May grant write access to * / 
59 /* pages. af 
60 #define PF_INHERIT 0x0200 /* Inherit across exec() add 
61 #define PF_VTEXT 0x0400 /* vnode was marked as VTEXT */ 
62 #define PF_MMFATTACH 0x0800 /* MMF pregion is being attached*/ 
63 
64 #define PREGMLOCKED (PRP) (PRP->p_ flags & PF_MLOCK) 
65 
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Pregion types. 


ap 
#define PT_UNUSED 0 
#define PT_UAREA 1 
#define PT_TEXT 2 
#define PT_DATA 3 
#define PT_STACK 4 
#define PT_SHMEM 5 
#define PT_NULLDREF 6 
#define PT_LIBTXT 7 
#define PT_LIBDAT 8 
#define PT_SIGSTACK 9 
#define PT_IO 10 
#define PT_MMAP 11 
#define PT_GRAFLOCKPG 12 
#define PT_NTYPES 13 


From 9.0 /etc/conf/h/vas.h 


#define VA_CACHE SIZE 
struct vas { | 

struct p lie va_ll; 
#define va_next va_ll.lle_next 
#define va_prev va_ll.lle prev 


1 


/* 


/* Unused pregion. */ 
/* U area * / 
/* Text region. * / 
/* Data region. * / 
/* Stack region. * / 
/* Shared memory region. */ 
/* Null pointer dereference page */ 
/* shared library text region */ 
/* shared library data region */ 


/* signal stack * / 
/* I/O region * / 
/* Memory mapped file */ 
/* Framebuffer lock page */ 
/* Total # pregion types defined */ 


Doubly linked list of pregions */ 


preg t *va_cache [VA_CACHE SIZE]; 


int va_refcnt; 
vm_sema_t va_lock; 
u_int va_rss; 

u_int va_prss; 

u_int va_swprss; 
u_long va_flags; 
struct file *va_fp; 
u_long va_wcount ; 
struct proc *va_proc; 
struct hdlvas va_hdl; 


i; 


typedef struct vas vas_t; 


/* 

* Values for va_flags 

+7 
#define VA_HOLES 0x0000000 
#define VA_IOMAP 0x0000000 
#define VA_NOTEXT 0x0000000 


/* 
/* 
/* 
/* 
/* 
/* 
/* 
/* 
/* 
/* 


1 
2 
4 


Number of pointers to this vas */ 

Lock structure */ 

Cached approx. of shared res. set size */ 
Cached approx. of private RSS (in mem) */ 
Cached approx. of private RSS (on swap) */ 
various flags */ 
file table entry for MMFs psuedo-vas */ 
count of writable MMFs sharing psuedo-vas * 
pointer to process, if there is one */ 

HW Dependent info for vas */ 


/* this needs to be visible to compile proc.h */ 


/* vas may have holes within pregions 
/* there may be an iomap pregion in th 
/* No text region in vas (EXEC_MAGIC a 


IG 
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131 
132 
133 
134 
135 
136 
137 
138 
139 
140 
141 
142 
143 
144 
145 
146 
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168 
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170 
171 
L72 
173 
174 
L753 
176 
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179 


From 9.0 /etc/conf/h/region.h: 


/* 


* Per region descriptor. 
* every active region in the system. 


* data 


if 


typedef 


} reg_t; 


#define 
#define 
#define 


elements here: 


struct region { 
ushort r_flags; 
ushort r_type; /* 
size_t Yr_pgsz; /* 
size_t r_nvalid; /* 
size_t r_swnvalid; /* 
/* 
size_t r_swalloc; /* 
ushort r_refcnt; /* 
size_t r_off; /* 
ushort r_incore; /* 
short r_mlockent ; /* 
/* 
int r_dbd; /* 


struct vnode *r_fstore; /* 
struct vnode *r_bstore; /* 
struct region *r_forw; /* 


struct region *r_back; 

short r_zomb; /* 

struct region : /* 
*r_hchain; 

union { 


struct old_aout { 
u_int r_ubyte; 


One is allocated for 
Beware if you add 
Dupreg may need to copy them. 


type of region */ 
size in pages */ 


number 


of valid pages in region */ 


resident set size of swapped region */ 
(r_nvalid value when region swapped) */ 


number 
offset 
number 
number 
region 


for RF_SWLAZY, # pgs actually allocated * 


of users pointing at region */ 
into vnode (page aligned) */ 

of users pointing at region */ 
of processes that locked this */ 
in memory. */ 


dbd for vfd’s when swapped */ 


pointer to vnode where blocks come from * 


pointer to vnode where blocks go */ 
links for list of all regions */ 


set by xinval to indicate text bad */ 
hash for region */ 


/* byte off in fstore (for old a.out) */ 


u_int r_ubytelen; /* byte len in fstore (for old a.out) */ 


} x_byt; 
struct mmf { 


struct ucred *r_ummfcred; /* credentials for MMF */ 


u_long r_filleri; 
} xv_mm£; 
} xv_un; , 
vm_sema_t r_lock; 
vm_sema_t r_mlock; 


int xr poip; /* number 
_P 
* 
* 
* sleep_ 
* field 
* 
*/ 


struct broot 
*r root; 
unsigned long r_key; /* 
chunk_t *r_chunk; 
struct region *r next; /* 
struct region *r_prev; 
struct pregion *r_pregs;/* 
struct hdlregion /* 
r_hdl; 


r_byte 
r_bytelen 
r_mmfcred 


r_un.r_byt 
r_un.r_byt 
r_un.r_mmf 


/* unused */ 


/* region lock */ 
/* wait for region to be locked in memory */ 


of page I/Os in progress 


NOTE: must hold the region lock and the 


lock to increment the r_poip 
(start an I/O). Must hold 


the sleep_lock to decrement. 


Each region contains chunk and one key */ 


/* Root of btree of vfd/dbd’s */ 


links for regions sharing pages */ 


list of pregions pointing to this region 
HDL fields in region */ 


.x_ubyte 


-r_ubytelen 
.x_ummfcred; 


180 
181 
182 
183 
184 
185 
186 
187 
188 
189 
190 
191 
192 
193 
194 
195 
196 
197 
198 
199 
200 
201 
202 
203 
204 
205 
206 
207 
208 
209 
210 
211 
212 
213 
214 
215 


/* 


* Region flags 


*/] 
#define 
#define 
#define 


#define 
#define 


#define 
#define 


#define 
#define 
#define 
#define 
#define 
#define 


/* 


RF_NOFREE 
RF_ALLOC 
RF_MLOCKING 


RF_ZOMB 
RF_UNALIGNED 


RF_SWLAZY 
RF_WANTLOCK 


RF_HASHED 
RF_EVERSWP 
RF_NOWSWP 
RF_DAEMON 
RF_UNMAP 
RF_IOMAP 


0x0001 
0x0004 
0x0008 


0x0010 
0x0020 


0x0040 
0x0080 


0x0100 
0x0200 
0x0400 
0x0800 
0x1000 
0x2000 


Don’t free region on last detach */ 
region is not on free list */ 

set when locking region in memory */ 
wake up processes waiting on r_mlock */ 
when resetting this flag. */ 

set in xinval when a text turns bad */ 
Region is an unaligned view of vnode */ 
(support old a.out) */ 

Don’t allocate all swap space up front */ 
someone else wants to lock this reg, */ 
so wakeup (rp) them. CHANGE FOR MP*/ 
region is hashed (fstore, byte) */ 

set if region has ever been swapped */ 
set if region is now swapped */ 

set if region is for a kernel daemon */ 
MMF region is being unmapped */ 

region is an iomap(7) region */ 


* Logical index from region offset to vnode offset in bytes. 


* 


#define vnodindx(RP, PGINDX) (ptob(PGINDX + (RP) ->r_off) ) 


/* 


* Region types 


*/ 
#define 
#define 
#define 


RT_UNUSED 
RT_PRIVATE 
RT_SHARED 


/* 
/* 
/* 


Region not being used. * / 
Private (non-shared) region. */ 
Shared region */ 


? | 


216 From 9.0 /etc/conf/h/conf.h: 


217 
218 /* 
219 * Swap device information 
220 * / 
221 typedef struct swdevt 
222 { 
223 dev_t  sw_dev; /* swap device gf 
224 int sw_enable; /* enabled * / 
225 int sw_start; /*_ offset for 300/700 * / 
226 int sw_nblks; /* number of blocks * / 
227 int sw_nfpgs; /* # of free pages * / 
228 int sw_priority; /* priority of device */ 
229 int sw_head; /* first swaptab[] entry*/ 
230 int sw_tail; /* last swaptab[] entry */ 
231 struct swdevt *sw_next; /* next swap device */ 
232 } swdev_t; 
233 
234 
235 From 9.0 /etc/conf/h/swap.h: 
236 
237 int fs_swap_debug; 
238 
239 /* The following structure contains the data describing a 
240 * swap file. 
241 * / 
242 
243 typedef struct swapmap { 
244 ushort sm_ucnt; /* number of users on this page */ 
245 short sm_next; /* index of free swapmap [] * / 
246 } swpm_t; 
Oo 269 
248 typedef struct swaptab { 
249 short st_free; /* index of lst free swapmap [] */ 
250 short st_next; /* index of next chunk for */ 
251 /* same dev or fs * / 
252 int st_flags; /* flags defined below. * / 
253 struct swdevt *st_dev; /* swap device. */ 
254 struct fswdevt *st_fsp; /* swap file system. * / 
255 struct vnode *st_vnode; /* dev or fs vnode */ 
256 /* system chunk * / 
257 int st_nfpgs; /* nbr of free pages on device*/ 
258 struct swapmap *st_swpmp; /* ptr to swapmap[] array. * / 
259 int st_site; /* site number (DUX) */ 
260 union { 
261 int st_start; /* starting addr on S300 * / 
262 int st_swptab; /* server swaptab[] index * / 
263 } st_union; 
264 } swpt_t; 
265 . 
266 typedef struct fswdevt{ 
267 struct fswdevt *fsw_next; /* next fs w/ same pri */ 
268 int fsw_enable; /* enabled */ 
269 int fsw_nfpgs; /* # free pages * / 
270 int fsw_allocated; /* # of blocks allocated*/ 
271 uint fsw_min; /* min # preallocated */ 
272 uint fsw_limit; /* max # to allocate * / 
273 uint fsw_reserve; /* # to reserve * / 
; @) 274 int fsw_priority; /* priority * / 
: 275 struct vnode *fsw_vnode; /* file system vnode * / 
276 short fsw_head; /* 1st swaptab[] entry */ 
277 short fsw_tail; /* last swaptab[] entry */. 
278 char fsw_mntpoint [256] ; /* file system mount pt.*/ 
279 } fswdev_t; 


280 


typedef struct 


devpri { 


struct swdevt *first; /* first fs for a priority * / 
struct swdevt *curr; /* allocate from this fs first */ 
} devpri_t; 
typedef struct fspri{ 
struct fswdevt *first; /* first fs for a priority */ 
struct fswdevt *curr; /* allocate from this fs first */ 
} fspri_t; 
/* 


* This is an overlay structure for a regular dbd. 
* It MUST be the same size as a dbd. 


ts 


typedef struct swpdbd { 


uint dbd _type:4, 
dbd_swptb:14, 
dbd_swpmp:14; 


} swpdbd_t; 


extern 
extern 
extern 
extern 
extern 
extern 
extern 
extern 
extern 
extern 
extern 


nswapfs; 

nswapdev ; 

swchunk ; 

maxswapchunks ; 

swapmem_cnt; 

swapspc_cnt; 

maxfs pri; 

maxdev_pri; 

struct vnode *swapdev_vp; 
struct swaptab *swapMAXSWAPTAB; 
vm_sema_t swap_lock; /* Lock for all swap entries */ 


extern vm_lock_t rswap_lock; /* Lock for reserveing swap */ 

extern int swapwant; /* Set non-zero if someone is */ 
/* waiting for swap space. * / 

#define SWTYPE_DEV Ox1 /* raw disk swap dev */ 

#define SWTYPE_FS 0x2 /* file system swap device */ 

#define SWTYPE_LAN 0x4 /* diskless (lan) swap device */ 


ie 


325 
326 
327 
328 
329 
330 
331 
332 
333 
334 
335 
336 
337 
338 
339 
340 
341 
342 
343 
344 
345 
346 
347 
348 
349 
350 
351 
352 
353 
354 
355 
356 
357 
358 
359 
360 
361 
362 
363 
364 
365 
366 
367 
368 


* 369 


370 
371 
372 
373 
374 
375 
376 
377 
378 
379 
380 
381 
382 
383 
384 
385 
386 
387 
388 
389 
390 


From 9.0 /etc/conf/h/vmmeter.h: 


/* 


* Virtual memory related instrumentation 


mf 


struct vmmeter 


#define v_first v_swtch 


unsigned 
unsigned 
unsigned 
unsigned 
unsigned 
unsigned 
unsigned 
unsigned 
unsigned 
unsigned 
unsigned 
unsigned 
unsigned 
unsigned 
unsigned 
unsigned 
unsigned 
unsigned 
unsigned 
unsigned 
unsigned 
unsigned 
unsigned 
unsigned 
unsigned 
unsigned 
unsigned 
unsigned 
unsigned 
unsigned 
unsigned 
unsigned 
unsigned 
unsigned 
unsigned 
unsigned 
unsigned 
unsigned 


v_swtch; 
v_trap; 
v_syscall; 
v_intr; 
v_pdma ; 
v_pswpin; 
v_pswpout ; 
v_pgin; 
v_pgout ; 
v_pgpgin; 
vV_pgpgout ; 
v_intrans; 
v_pgrec; 
v_xsfrec; 
v_xifrec; 
v_exfod; 
v_zfod; 
v_vrfod; 
v_nexfod; 
v_nzfod; 
v_nvrfod; 
v_pgfrec; 
v_faults; 
v_scan; 
v_rev; 
v_seqfree; 
v_dfree; 
v_cwfault ; 
£_bread; 
£_breadcache ; 
f_breadsize; 
f_breada; 
£_breadacache; 
£_breadasize; 
f_bwrite; 

f£ _bwritesize; 
£ _bdwrite; 

£ _bdwritesize; 


#ifdef —_ hp9000s800 


unsigned 
unsigned 


v_pgtlb; 
vV_swpwrt; 


#endif /* —_hp9000s800 */ 


#define 


}; 


#ifdef 
extern 
#Hendif 


unsigned 
unsigned 
unsigned 
unsigned 
unsigned 
v_last 

unsigned 
unsigned 
unsigned 
unsigned 


_KERNEL 


v_fastpgrec; 
£ clnbkf1; 

f flsempty; 
£ bufbusy; 

f delwrite; 
f delwrite 
v_free; 
v_swpin; 
v_swpout; 
v_runq; 


Gr 


struct vmmeter cnt, 


/* 
/* 
/* 
/* 
/* 
/* 
/* 
/* 
/* 
/* 
/* 
/* 
/* 
/* 
/* 
/* 
/* 
/* 
/* 
/* 
/* 
/* 
/* 
/* 
/* 
/* 
/* 
/* 
/* 
/* 
/* 
/* 
/* 
/* 
/* 
/* 
/* 
/* 


/* 
/* 


/* 
/* 
/* 
/* 
/* 


/* 
/* 
/* 
/* 


context switches */ 


calls 


calls to syscall() */ 


to trap */ 


device interrupts */ 


pseudo-dma interrupts */ 


pages 
pages 


swapped in */ 
swapped out */ 


pageins */ 
pageouts */ 


pages 
pages 


intransit blocking page faults */ 
page reclaims */ 


total 
found 
found 
pages 
pages 
fills 


paged in */ 
paged out */ 


in free list rather than on swapdev 
in free list rather than in filsys 
filled on demand from executables * 


zero filled on demand */ 

of pages mapped by vread() */ 
number of exfod’s created */ 
number of zfod’s created */ 

number of vrfod’s created */ 

page reclaims from free list */ 
total faults taken */ 

scans in page out daemon */ 
revolutions of the hand */ 


pages taken from sequential programs */ 


pages freed by daemon */ 
Copy on write faults */ 
bread requests */ 
bread cache hits */ 


total 
total 
total 
total 
total 
total 
total 
total 
total 
total 


bread bytes */ 
read aheads */ 


read ahead cache hits */ 
read ahead bytes */ 
bwrite requests */ 
bwrite bytes */ 

bdwrite requests */ 
bdwrite bytes */ 


tlb flushes */ 
swap writes */ 


fast reclaims in locore */ 


clean block found immediatly on free list 
free list empty */ 
buffer busy */ 


delayed write buffer written */ 


free memory pages */ 
swapins */ 
swapouts */ 


current leng 


a EA HLS 
oO A od st 


és 


a 


th of run queue */ 


24 
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/* systemwide totals computed every five seconds */ 


struct 


}; 

#ifdef 
extern 
#endif 


/* 


* Miscellaneous virtual memory 


a 


#ifdef 
extern 
extern 
extern 
extern 
extern 
extern 
extern 
extern 


/* writable 


extern 
extern 
extern 
extern 
extern 


vmtotal 
unsigned int t_rq; /* 
unsigned int t_dw; /* 
unsigned int t_pw; /* 
unsigned int t_sl; /* 
unsigned int t_sw; /* 
int t_vm; /* 
int t_avm; /* 
unsigned int t_rm; /* 
unsigned int t_arm; /* 
int t_vmtxt ; /* 
int t_avmtxt ; /* 
unsigned int t_rmtxt; /* 
unsigned int t_armtxt; /* 
unsigned int t_free; /* 
_KERNEL 
struct vmtotal total; 


_KERNEL 
int 
int 
int 
int 
int 
int 
int 
int 


int 
int 
Lat 
int 
int 


From 9.0 /etc/conf/h/vmsystm.h: 


freemem; /* 
freemem_cnt; /* 
avefree; /* 
avefree30; /* 
deficit; /* 
nscan; /* 
multprog; /* 
desscan; /* 


copies of tunables */ 


maxslp; /* 
lotsfree; /* 
minfree; /* 
desfree; /* 
saferss; /* 


length of the run queue */ 

jobs in ‘‘disk wait’’ (neg priority) */ 
jobs in page wait */ 

jobs sleeping in core */ 

swapped out runnable/short block jobs */ 
total virtual memory */ 

active virtual memory */ 

total real memory in use */ 

active real memory */ 

virtual memory used by text */ 

active virtual memory used by text */ 
real memory used by text */ 

active real memory used by text */ 

free memory pages */ 


subsystem variables and structures. 


remaining blocks of free memory */ 
number of processes waiting on freemem */ 
moving average of remaining free blocks * 
30 sec (avefree is 5 sec) moving average 
estimate of needs of new swapped in procs 
number of scans in last second */ 

current multiprogramming degree */ 
desired pages scanned per second */ 


max sleep time before very swappable */ 
max free before clock freezes */ 

minimum free pages before swapping begins 
no of pages to try to keep free via daemo 
no pages not to steal; decays with slptim 


/* AGEFRACTION of n means we want to age 1/n of a region before going on */ 
/* AGEFRACTION of 16 is the smallest possible since p_ageremain is a short * 
#define LOGAGEFRACTION 4 
#define AGEFRACTION (1 << LOGAGEFRACTION) 
#define AGEFRACTIONMASK (AGEFRACTION - 1) 


#tendif 
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Diskless 


The Big Picture 


- How does HP-UX do without a disk? 


The Little Picture(s) 
- What a cnode can and can’t do 
- Context 
- Crash Detection 
- The server’s view 


- References 


O 
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| Diskless 


What a Cnode Can And Can’t Do 


- It can... 
...run programs & deal with I/O, context switching, etc. 
..-handle its own swapping if a local swap disk is present 
...be a fully functional networking node/gateway 

a 20 Can’ ©... 


...access its own filesystem - there’s no disk! 
- 8.0 allows "locally mounted filesystems"; 


really "locally attached filesystem disks", 
Since they are part of the cluster’s filesystem 


AAAA 


...allocate its own PIDs independently 
...Swap (assuming no local disk) 
- local swap has always been allowed 


- in 8.0 one cnode can act as the "Swap server" 
for other cnodes iff 


a) it has a local swap disk and 
b) its cnode id is shown as their 
swap site in /etc/clusterconf 


...automagically keep its clock in synch 
..access devices on the server or other nodes (what 


is a device file? what would remote device support 
imply?) 


O 
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Diskless 


Context 
~, - Set at boot time. 


- Provides a general mechanism for matching files with machines 
and/or capabilities. 


- If a machine has a fieating point accelerator in it, 
that implies that it needs to "see" a different math 
library than a normal machine would need. 


- In theory, this sort of thing could be used to allow 
for having both UCB and AT&T command sets available, or 
providing for a S300 and S800 to get their respective 
executables off of the same disk. This is in fact what 
is done in 7.0/8.0 when we have an S800 serving S300 
clients; /bin and many other things become CDFs. 


- The key place it is used in the kernel is in pathname lookup. 
When the search for "/etc/reboot" finds its way to the actual 
disk, the system will notice if the file is a CDF. If it is, 
it will drop down into the directory and start looking for 
files that match a context string. 


- What are the implications of having "system" files be CDFs? 


SE 3002: Surviving as a Workstation SE 


Diskless 


Fun With CDFs 


they’re tricky! 


be sure to use "-hidden" with find(1) if you care about CDFs 
"ll -H" is your friend :-) 


if something isn’t a CDF when you first install the system, 
it probably shouldn’t be, e.g. making /etc a CDF so that 
the passwd, group, ... files can be customized on each 
client may seem clever at first, but will seem decidedly 
un-clever next time you want to boot :- ( 


be conscious of different "priorities" of context elements, 
i.e. having a CDF element for the server (by its name) and 
one for "localroot" too is a bad plan 


cnode-specific device files are often confused with CDFs, but 
are something different - basically a cnode-specific device 
file is one that can only be used on a particular cnode. By 
default, a device file can only be used on the machine it is 
created on; specifying additional options to mknod(1m) can 
yield a device file that is 1) specific to another cnode; 

2) global (usable by the whole cluster) 
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Diskless 


Crash Detection 


- When in the course of human events a diskless node goes out 
to lunch, it takes cluster resources with it. It is important 
that this be detected quickly, since other nodes may be waiting 
on files or memory or whatever. 


- Whenever a node receives a packet from another node, it keeps 
track of this. If it notices that it hasn’t received a packet 
from a node very recently, it will send a message to that node 
asking it to respond. If it does, fine; if not, it is declared 
dead and its resources are reclaimed. 


- The kernel parameters check_alive_period and retry_alive_period 
deal with this. If for some reason it is OK/expected that nodes 
will be unable to respond quickly, they may need to be raised, 
but in general they should be left alone. 
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Diskless 


The Server’s View 


The server is an ordinary system except that it has a few extra 
processes running. 160 ; 

\L49 
When a server cluster(l1m)’s, it starts up a "Limited CSP". This 
CSP is only willing to do certain things; if it is asked to do 
something that might take a while, it will put the request ona 
queue and let a "General CSP" handle it. 


CSPs run at "important" priorities, i.e. better than normal 
user processes, but not real-time. 


When a request comes in froma cnode, it is put on a queue. When 
a CSP becomes available, it will grab the request and start 
working on it. 


If a request takes too long, the CSP will commit suicide 
when it finishes - it will already have been replaced. 


The server is responsible for keeping the clocks synchronized 
(otherwise make wouldn’t work right), allocating chunks of PIDs 
to cnodes (lots of things use PIDs to generate filenames), and 
doing the swap and filesystem serving. 


The server must find out quickly if a node fails, so that 
resources can be reclaimed. 


If the server needs to reboot, it must shut down all the clients 


Rane) prcbly Dae fie bane 
retail boltel | 


- 
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System Startup 


The Big Picture 


How do we get from a doing-nothing system to a system 
running HP-UX? 


The Little Pictures 


The boot ROM and secondary loader. 
Configuring the virtual-memory subsystem. 
Preparing for I/O. 

Kicking off the first processes. 


What is the correspondence between things being accomplished 
and things being printed on the console’s screen? 


SE 390: Series 300 HP-UX Internals 
| System Startup 
Boot Rom and Secondary Loader (S300/400) 
- The first 8K block of the disk is a boot block, which 


contains a LIF directory .and the secondary loader. A 
copy of this block can be found in /etc/boot. 


first 8K (not to scale!) 
tr rr rr re ee ee ee ee ee eee + 
SYSHPUX : secon- | 
SYSBCKUP : dary filesystem | , swap 


SYSDEBUG : loader 


- The boot rom reads the LIF directory for each disk present 
and allows the user to choose one of the entries (assuming 
attended boot). 


- Once an entry has been chosen, the bootrom loads the secondary 
loader and starts it running. The secondary loader "knows" 
where the bootrom keeps some of its variables, and it goes and 
looks to see which of the possible filenames was picked. 


(The bootrom uses the top page of physical RAM to store 
variables. The kernel also has the top page mapped, and 

the name of the kernel we booted is accessible via the kernel 
variable "sysname"; the disk is designated by "msus".) 


- The bootrom provides some very simple I/O routines, and the 
secondary loader uses these to print out the message, "booting 
/hp-ux" (assuming the default case) and to read in the kernel. 


The secondary loader has a bare-bones knowledge of the file 
system, and is smart enough to go look in /etc/clusterconf and 
pick an appropriate kernel out of /hp-ux+ based on that. 


- Once the kernel has been read in, the loader jumps to it, passing 
it the processor type, the address at which it was loaded, etc. 


- The kernel is now on itS own.... 
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System Startup 


System Boot (S700) 


- The first 8K of the disk is a boot block, which 
contains a LIF directory. Doing a "lifls -1" of some 
bootable disk will show that there are quite a few 
entries in the directory: filesystem, swap, HP-UX, 
some stuff for debuggers, etc. Most of this stuff is 
in the "boot area", which is at the end of a 700’s disk. 


A typical system disk might be laid out something like this: 


Note that this looks much like a 300/400 disk. The major 
difference is that the "secondary loader" for a PA machine is 
too big to fit into the ist 8K block of the disk like the 300 
would do, so it has been moved to a 2MB area at the end. 


- The bootrom will search for possible boot devices and consoles 
if it hasn’t been told in advance where to boot from. To 
interact with it, press and hold ESC shortly after powering on 
the machine; this will cause it to enter a menu-driven mode 
in which lots of things can be set/changed (things like boot 
paths, console/kKeyboard paths, the LANIC address, etc. 


NOTE: typing "secure on" at this point will keep you from 
ever being able to change bootpaths, console paths, etc.* 


~ Once a device has been chosen to boot from, find something 
else to do; it will be quite a while before anything happens 
on the console. Once the kernel is loaded and initialized, 
though, the 700 will make up for its initial sluggishness. 
It will ID cards (and really look quite a bit like a 300) 
as it boots and observers will be hard-pressed to keep up with 
what is being displayed. 


- Once the kernel is running, the system will go through all of 
the normal user-space things like /etc/rc, /etc/netlinkrc, etc. . 
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System Startup 


Starting Up The Virtual Memory System 
- Set up the kernel page table ("Sysmap") and turn on the MMU. 


- Initialize kernel memory mapping. The kernel *must* 
know about all physical memory: some is allocated to the 
kernel itself, some is allocated to user processes, and 
*all* of it must be kept track of. 


- See what swap devices are available. The table is 
specified in conf.c, and is called swdevt[]. At 
boot time it is scanned, and the disks are checked 
to make sure the space is really there, etc. This is 
when the system prints 
Swap device table: start and size given in 512-byte blocks... 
entry 0: autoconfigured on root device; start=X, size=Y 
- Enable the first swap device in swdevt[]. 
- Fork process 2 to be the pageout daemon. 


- Start looking for jobs to swap in/out. 


Preparing 
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System Startup 


For I/O. 


Call device driver link routines. Note the *_ link 

routines in /etc/conf/conf.c after you have run config(1m). 

At bootup time, the system will walk that whole list, calling 
each routine in it. The routine will add an entry for its 
driver to a list that will be used when we actually find cards. 


See what cards are installed. When a card is found, walk the 
list mentioned above. When a driver claims the card as its 
own, it will allocate data structures and do any other startup 
initialization (e.g. adding an entry to rupttable on the 68K). 


Look for a console. See the Facilities (Concepts & Tutorials) 
manual for the order in which things will be chosen. 


Mount the root filesystem. This is done by asking each disk 
driver whether it knows about the disk the bootrom says we 
booted from (this information is put in the top page of RAM by 
the bootrom along with the name ("SYSHPUX", "SYSBCKUP", etc)). 
When we find a driver that claims the disk, we can call its 
"open" routine and mount the disk. 
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System Startup 


Starting The First Processes 


Build process 0 by hand; it will become the swapper. 


Start roundrobin scheduling. This isn’t really a process, but 
sort of acts like one. What we actually do is arrange fora 
routine to be called every <timeslice> cpu ticks. 


Fork process 2 to become the pageout daemon. 
Start CSP if this is a diskless node. 


Fork process 1 to become init. We actually do some stuff to set 
this up as a user process so that when /etc/init is exec(2)ed, 
it 1s a normal user process. It is somewhat special, however, 
because the kernel sort of looks out for it in a few areas (such 
as not letting someone send SIGKILL to it, panic()ing if it 
exit(2)s, etc). | 


In 8.0, the kernel runs /etc/pre_init_re before starting 
/etc/init so that the root filesystem can be checked without 
any interference from user processes. Note that pre_init_rce 
checks /dev/rroot, which is a character-special file that 
represents the root disk (major & minor are both -1). If 
/dev/rroot gets destroyed or isn’t there for some reason, 

# mknod /dev/rroot c -1 -1 will fix it. 


SE 390: Series 300 HP-UX Internals 


System Startup 


~~ Internal Actions vs. External Signs (on a 68K system; 700 is similar) 


- "booting /hp-ux" 


set up kernel page table 

get info. from bootrom: processor type, amount of RAM,... 
allocate RAM for buffer cache, cmap, inodes, etc. 

clear out memory and decide if we have enough to continue 
call device driver link routines 

look for ttys, init. console 


- "Console is ITE" 
"ITE + 0 ports" 
"680x0 processor" 
"MC68881 coprocessor" 


look for I/O cards 


- "xxxxx at select code yy" - for each card found 
"real mem = 2s[oooKxxXx" 
"mem reserved for dos = xxxxxxx" 
"using xxx buffers containing yyyyyy bytes of memory" 


twiddle data structures to reflect proc. 0 
start clock 

initialize root device 

initialize diskless stuff 


- "Local link is 2»[ooocxKxxx" 
"Server link is yyyyyyyy" > diskless systems only... 
"Swap site is nn" / 
"Root device major is xx, minor is yyyy [root site is xx]" 


initialize buffer cache 


- "Swap device table: (start and size...)" \ these are present 
Ue. ee (line for each entry) ...." / only if local swap 
"Savecore image of xx pages will be saved at block yy in swap area 


configure swap devices 

mount root filesystem 

start up CPU roundrobin scheduling 
start up paging subsystem 

start up limited CSP 


8.0: check root filesystem via /etc/pre_init_re 


- "avail mem = s:[ooooKXXx" 
"lockable mem = xxxxxxx" 


fork init 
become the swapper 


<any further (normal) messages will be from init or its childrens 


Good CS Reference Books 


_The Design of the UNIX Operating System_ - Maurice Bach 
_Advanced Programming in the UNIX Environment_ - W. Richard Stevens 
_Modern Operating Systems _ - Andrew Tanenbaum 


_Operating Systems: Design and Implementation_ - nnavey Tanenbaum 

_Operating Systems Design: The XINU Approach_ - Douglas Comer 

_The Design and Implementation of the 4.3BSD UNIX Operating System_ 
Leffler, McKusick, Karels, and ack Cast 

_Algorithms + Data structures = Programs_ - Wirth 

_Algorithms  - Sedgewick 

_ Computer Networks. - Tanenbaum 

_UNIX Network Programming_ - W. Richard Stevens 

_Fundamentals of Interactive Computer Graphics _ - Foley & Van Dam 

_internetworking With TCP/IP_ - Comer 

_Practical UNIX Security _ - Garfinkel & Spafford 

_Software Tools In Pascal_ - Kernighan & Plauger 

_The Elements of Programming Style_ - Kernighan & Plauger 

_The UNIX Programming Environment_ - Kernighan & Pike 


_UNIX System Administration Handbook_ - Nemeth, Snyder, & Seebass 


A good place to get the above if you can’t find them locally... 


Computer Literacy Bookshop 
408-730-9955 
520 Lawrence Expressway, Sunnyvale, CA 94086 


2 blocks south of US 101, next to TOGO’s 

Open 7 days/week; mail orders, phone orders welcome 
America’s largest computer bookstore 

10,000 professional and PC titles 


Kernel Debugging Hints 


- 1. Dealing with "hung" processes. 


When a process needs something that it can’t have (inside the kernel), 

it will call a kernel routine named sleep(). One ofthe arguments it 

is called with is a priority; if this is less than PZERO (see param.h), 
this means that the sleep is *not* interruptible. If this is the case, 
the sleep() had better be pretty short; if it turns out not to be, we 
will wind up with a non-killable hung process. This is not A Good Thing. 


How to deal with it? There are several ways. The first is to run 
monitor and see what its "single process info" screen will tell you 
about the process. The second is to use "ps -1" to get the sleep 
channel and priority. If the priority is < PZERO, chances are this is 
a driver bug. If we want to keep on investigating, we can feed this 
address to adb(1) to find out what’s being waited for: 

adb /hp-ux /dev/kmem 
This will usually work, but there’s a catch. Suppose the sleep channel 
is 0x12345678. By default, adb(1) is only willing to look at addresses 
less than 0x1000000 (16 MB). If the sleep address is above this, it will 
be necessary to change adb(1)’s mapping, like this: 

/m O OxlfffFffff oO 


This tells adb(1) to use a big piece of the address space, instead of 
just a tiny one. ee 


Once the mapping is straightened out, use a command like this: 
Ox<sleep channel>/i 


If adb(1) can find a symbol near that address, it will print out 
something like this: 


_Bufferaddr+0x94: 
This tells us that we may be waiting on a buffer. Sometimes this is 
helpful, sometimes not; it is worth remembering. 
2. Figuring out what went wrong in a system call or library routine. 


This shouldn’t be in here, but in the interest of fending off questions, 
it is :- 


Let’s suppose someone writes a new version of cat(1), like this: 


#include <stdio.h> 
#include <fcntl.h> 


main(argc, argv) 
int argc; 
Char *argv[]; 


int fd, n; 
char buf[8192]}; 


fd = open(argv[({1], O _RDONLY) ; 


while ((n = read(fd, buf, 8192)) > 0) 
| write(1, buf, n); 


close(fd); 


Suppose that this is invoked on some file, and nothing comes out. [Is it 
necessarily because there isn’t anything in the file? What if...the mode 
of the file didn’t allow access? 


3. Miscellaneous. 


If you are getting absolutely *bizarre* behavior from your systen, 
consider the possibility that you have a mismatch between different 
parts (kernel vs. commands, part of kernel vs. another part, etc). 

I once had an SE call in with a *strange* set of symptoms that I 
Simply couldn’t explain. It turned out that he had mixed 5.5 and 6.0 
kernel library archives! 


CDFs can cause pretty bizarre behavior if you aren’t watching out 
for them. 


If a device driver (or some other configurable part of the kernel) is 


not 
For 
the 
How 


configured in, the error one gets back isn’t necessarily clear.... 
instance, if diskless is not configured into the server’s kernel 
cluster(1m) command will fail with "no such device or address". 
enlightening :-) 


Driver Writing Information & Hints 


Introduction 


This document is taken from the prestudy for SE327, the now-defunct 
driver-writing class. If you are looking for a basic introduction to 

the concepts, this is worth reading. If you want more detailed information, 
order the HP-UX Driver Development Guide (98577-90013 as of August 1991). 


What is a Driver? 
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Just what is a driver, anyway? 


A. A "driver" is one of four distinct personality types, the other three 
being "amiable", "expressive", and "analytic". 


B. A "driver", along with the "iron", the "wedge", and the "putter", 
comprise the equipment needed for a game of golf. A driver is 
designed to deliver maximum force to the ball, and to sink fastest 
when thrown into water hazards in disgust. It also can be used to 
create larger divots when irons are insufficient for the task. 


C. A "driver" is the person sitting behind the steering apparatus of a 
locomotion vehicle. The only known exception to this rule is the 
"mother-in-law", which can be seated anywhere within the vehicle and 
still drive effectively. 


D. A "driver" is a piece of code which enables communication between the 
user and a particular piece of hardware. 


The correct answer, of course, is D. The driver bridges the gap between the 
user and the target hardware. 


User-Land Versus Kernel Drivers 


A driver can run as a user process (in "user-land") or as a kernel process. A 
driver executing as a user-land process runs at normal user priorities, and is 
subject to the same’ scheduling rules as any other process. The advantages of 
a user-land driver are: 

1. There is no kernel re-build or reboot necessary. 

2. The driver writer can use adb/cdb for debugging. 

3. The driver writer can use familiar user libraries in his/her code. 

4. The driver writer has no need of kernel knowledge. 
An example of a product which requires user-land drivers is the old VME 
expander (98646A). Drivers for VME cards installed in that product had to run 
in user-land. 
Some of the disadvantages of user-land drivers are: 

1. They’re slow! 


2. Interrupts aren’t available. 


3. DMA isn’t available. 


The driver writer needs to evaluate his/her application and weigh the 
trade-offs between user-land and kernel drivers before deciding which is right 
for the task. Often, a simple user-land program will do the job in situations 
which don’t require great speed, interrupts, or DMA. Some tools available for 
writing user-land drivers are: 


1. Pseudo-terminals (ptys) - for RS-232/serial devices; 
2. Device I/O Library (DIL) - for HP-IB or GPIO devices; 


3. Iomap - useful with just about any interface card for which the driver 
writer has a register map. Maps a particular chunk of physical memory 
into user space. 


Since the purpose of the SE327 driver writing class is to fully describe 
kernel drivers, only kernel drivers will be discussed from this point on. 


Types of Drivers 


There are two types of kernel drivers: interface drivers and device drivers. 


The interface driver communicates with a particular type of interface card and 
doesn’t concern itself with the devices connected to that card, if any. For 
example, there are interface drivers for the MUX card and the HP-IB card. 


A device driver communicates with a particular class of device and doesn’t care 
about the interface it’s connected to. For example, a device driver would talk 
to a CS/80 disc, a ciper printer, or a serial device. 


These two types of kernel drivers can be combined into one driver if only one 
class of device can be connected to a particular interface card. Some of the 
more complex interfaces, like HP-IB, have three interface drivers (for the 
98624, the 98625, and the internal HP-IB interfaces) and a multitude of device 
drivers (for HP-IB printers, ciper printers, CS/80 discs, other discs, 9-track 
mag tape, etc.). 


Types of Driver Access 


There are two types of kernel driver access: block access and character (raw) 
access. 


When block access is used, data transferred between a user process anda 
device is buffered. Data transfer occurs in units called blocks. 


When character access is used, there is no particular buffering scheme used, 
although the driver writer can use a buffering scheme if he/she so desires. 
Data is transferred in units of one or more bytes. 
The type of access used depends heavily on the device to which the driver 
talks. Devices having the following characteristics are good candidates for 
block access: 

1. The device supports random access of blocks. 

2. The data in each block is stable. 

3. The data is not available until it is requested. 


Typical block devices are discs and tapes. 


Devices having the following characteristics are good candidates for character 
access: 


1. The data cannot be accessed randomly. 

2. The data is not stable. 

3. The data can be available before any process requests it. 
Typical character devices are terminals and printers. 


Note that most devices can be accessed both ways. However, one type of access 
is usually optimal for a particular type of device. 


Driver Entry Points 
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The HP-UX kernel expects all drivers to consist of one or more routines whose 
names are consistent across all drivers. These. routine names are called | 
"entry points". A driver may or may not have a particular entry point, but if 
it does, that entry point will always have the same name (how an entry point 
for one driver is distinguished from the same entry point for another driver is 
discussed later in this document). 


There are a different set of entry points for character drivers and block 
drivers. The character driver entry points are: 


Entry Point Function 


open Called from open (2) 
close Called from close (2) 
read Called from read(2) 
write Called from write (2) 
ioctl Called from ioctl (2). 
select Called from select (2) 


For block drivers, the entry points are: 


‘Entry Point Function 
open Called from open(2). 
close Called from close (2) 
strategy Called from read(2) or write(2) 
size Not user-callable; returns size of 


Swap area on device, if any 


These two sets of entry points simply mean that these are the routines the 
kernel knows how to call, given a particular type of driver access. Nothing 
stops the driver writer from writing a strategy routine for a character driver 
(in fact this is often done). The kernel won’t know how to call it, but the 
driver code itself can explicitly call it. 


In addition to these entry points, there are three more entry points for 
interface routines (used in DIO drivers only). The purpose of these routines 
will be discussed in class. These interface routines are: 


* link 
* make_entry 


* init 


Finally, there are three "pseudo-driver" entry points. They are: 


Entry Point Function 
nulldev Does nothing; kernel returns successfully to user. 
nodev Does nothing; kernel returns an error to user. 


seltrue Does nothing; kernel returns successfully to user. 


Used in place of a select routine when device is 
always ready for I/O. 


These pseudo-driver entry points will be discussed in more detail later in 
this document, and in class. Note that "seltrue" has identical functionality 
to "nulldev". It exists at all simply because it is part of AT&T’s standard 
UNIX release. 


The Cdevsw and Bdevsw Tables 


How does the kernel keep track of the routines in each driver? 


There are two data structures, called the cdevsw table and the bdevsw table, 
which maintain pointers to the routines in each driver. The cdevsw table is 
used for character drivers, and the bdevsw table is used for block drivers. 


Each table is an array of structures. The array is indexed by the major 
number of the driver. Thus, at bdevsw[0] one would expect to find pointers to 
entry points in the block CS/80 driver (major number 0), and in cdevsw[4] one 
would expect to find pointers to entry points in the character CS/80 driver 
(major number 4). 


Each cdevsw table entry looks like this: 


struct cdevsw { 
int (*d_open) (); 
int (*d_close) () ; 
int (*d_read) (); 
int (*d_write) (); 
int (*d_ioctl) ()5. 
int (*d_select) (); 
int d_flags; 


. 


hi 


Each cdevsw table entry contains pointers for the six character driver entry 
points, and a parameter "d_ flags" to contain flags. The available flags are: 


C_ALLCLOSES specifies that the close entry point shall be called on all 
closes of the device, instead of only the last close. 


C_NODELAY specifies that the kernel shall not wait for I/O to 
complete, but shall return immediately to the user process. 


Each bdevsw table entry is similar: 


struct bdevsw { 
int (*d_open) (); 
int (*d_close) (); 
int (*d_strategy) (); 
int (*d_psize) (); 


Each bdevsw table entry contains pointers for the four block driver entry 
points, and the same flags parameter "d_ flags". 


Installing a Driver 


The procedure for installing a driver into a Series 300 HP-UX kernel is really 
quite simple. The overall procedure is given here, with more detail given in 
later sections. The procedure is: 


1. Compile driver. 


2. Modify /etc/master. 

3. Add driver name to dfile. 
4. Execute "config". 

5. Modify config.mk. 

6. Execute "make". 


This creates a new kernel which must be moved to /hp-ux. Once the system is 
rebooted, the new kernel is active. 


Compile the Driver 


Once the driver writer has written all his/her code, it must be compiled to 
create a ".o" file. 


Modify /etc/master 


This is probably the most time-consuming step. A line of information 
regarding the new driver must be added to /etc/master. The "config" routine 
uses this information in setting up the cdevsw and bdevsw tables and other 
data structures in conf.c. 


Each line in the first section of /etc/master gives information for one 
Griver. Each line is of the form: 


name prefix type mask bmajor cmajor 


"Name" is the driver name for use in config’s dfile. Use any descriptive name 
not already in use. 


"Prefix" can be the same as "name", or some other descriptive string. It is 
this string that the kernel uses to differentiate your kernel driver’s entry 
points from other drivers’ entry points. For example, if you specify a 
"prefix" of "mycode", the kernel expects to find entry points named 
"mycode_open", "mycode_ close", etc. The driver writer presumably knows this 
and codes his/her routine names accordingly. 


"Type" is a five-bit attribute flag. It has the following form: 


The meanings of the bits are: 


bit 0 - Set this bit if the driver should have an entry in the cdevsw 
table (which it should if it is a character driver). 


bit 1 - Set this bit if the driver should have an entry in the bdevsw 
table (which it should if it is a block driver). 


bit: 2 Set this bit if the driver is a required driver. "Config" will 
include the driver in the new kernel whether its name appears in 


dfile or not. 


bit 3 - Set this bit if the driver name may only be specified once in 
dfile. If the driver’s name appears in dfile more than once, an 
error is generated. Normally this is not an error. 


bit 4 - Set this bit if this driver is an interface driver. This implies 
the presence of link, make_entry, and init routines. 


"Mask" is a 10-bit driver routine flag. It has the following form: 


The meanings of the bits are: 


bit 0 - Set this bit if the C_ALLCLOSES flag is desired. Otherwise, this 
flag is left unset. 


bit 1 - Set this bit if the "seltrue" pseudo entry point is desired 
instead of an actual "select" entry point. 


bit 2 - Set this bit if the driver has a select routine. 

bit 3 - Set this bit if the driver has an ioctl routine. 

bit 4 - Set this bit if the driver has a write routine. 

bit 5 - Set this bit if the driver has a read routine. 

bit 6 - Set this bit if the driver has a close routine. 

bit 7 - Set this bit if the driver has an open routine. 

bit 8 - Set this bit if the driver has a link routine. 

bit 9 - Set this bit if the driver hae a size routine. 
(Note that there is no bit specifying whether or not a block driver has a 
strategy routine. It turns out that config expects to find a strategy 
routine in all block drivers. An undefined external results if a block 


driver having no strategy routine is installed.) 


"Bmajor" is the block major number of the driver, if any. Specify -1 
otherwise. 


"Cmajor" is the character major number of the driver, if any. Specify -1 
otherwise. 


Determine values for all fields of the /etc/master line, and enter that line 
in /etc/master. Here are some sample entries: 


* name prefix type mask block char 
* 

cs80 cs80 3 3FB 0 4 
flex mf 3 1FA 1 6 
amigo amigo 3 3FB 2 11 
tape tp a FA -1 5 
printer lp 1 DA -1 7 
stape stp 1 FA =2 9 
srm srm629 ai 1F2 at 13 
plot.old pt 1 F2 -1 14 
rje rje 1 1FA -1 15 
ptymas ptym 9 FC =1. 16 
ptyslv ptys 9 1FD -1 eae Sy | 
ieeesg02 ieee802 1 1FD -1 18 
ethernet ethernet ai 1FD -1 19 
hpib hpib 1 FB -1 21 
gpio hpib 1 1FB -1 22 
ciper ciper 1 DA -1 26 
snalink snalink 1 1C0 -1 36 
dos dos 1 F9 -1 27 


For example, in the "cs80" line above, the CS/80 driver should have both a 


cdevsw and bdevsw table entry (according to "type"), and contains routines 
for all entry points except a true select routine (seltrue is used instead) . 
The block major number is 0, and the character major number is 4. 


Add Driver Name to Dfile 


Edit an existing dfile, or create your own, and add the name of your driver to 
it (the name to enter is the same as "name" in the /etc/master entry you 
created). This causes "config" to include it in the new kernel. 


Execute Config 


Execute the "config" routine as follows: 
config dfile 


Config uses the information in dfile and /etc/master to create a conf.c file 
and a makefile called config.mk. The conf.c file contains all kernel 
configuration information modified per the instructions in /etc/master and 
dfile. For example, conf.c contains the new bdevsw and cdevsw tables, new 
kernel parameter settings, if any, etc. The config.mk makefile contains the 
instructions needed by "make" to compile and link a new kernel. 


For each driver name mentioned in dfile, config finds a line in /etc/master 
whose first field matches that name, and uses the information on that line to 
complete configuration for that driver. It builds the cdevsw and bdevsw 
tables by looking at "type" (to determine if entries should be built at all) 
and "mask" (to determine which entry points the driver contains). Config 
fills in the cdevsw/bdevsw tables with pointers to the actual routine names by 
adding the "prefix" and an underscore to the beginning of each entry point 
defined by that driver, and installing the resulting string into the table. 

It also adds an "external" declaration for the resulting routine name to 
conf.c. 


A portion of a cdevsw table from conf.c is shown below: 


struct cdevsw cdevsw[] = { 

/* 0*/ cons_open,cons_close,cons_read,cons_write,cons_ioctl1,cons_select, 
C_ALLCLOSES, 

/* 1*/ tty_open, tty_close, tty_read, tty_write, tty_ioctl, tty_select, 
C_ALLCLOSES, 

/* 2*/ sy_open, sy_close, sy_read, sy_write, sy_ioctl, sy_select, 
C_ALLCLOSES, ; 

/* 3*/ nulidev, nulldev, mm_read, mm_write, nodev, seltrue, 0, 

/* 4*/ cs80_open, cs80_close, cs80_read, cs80_write, cs80_ioctl, seltrue, 
C_ALLCLOSES, 

/* 5*/ tp_open, tp_close, tp_read, tp_write, tp_ioctl, seltrue, 0, 

/* 6*/ nodev, nodev, nodev, nodev, nodev, nodev, 0, 

/* 7*/ A1p_open, lp_close, nodev, lp_write, lp_ioctl, seltrue, 0, 


The commented numbers help identify which character major number each line is 
associated with. 


Note that missing entry points are automatically filled in by "config" with 

"nodev". (Whether or not an entry point is missing is specified by "mask" .) 

This means that the kernel will do nothing and return an error if a user 

process calls a system call corresponding to the entry point in whose slot the 

"nodev" exists. For example, using the above table fragment, if a user issues 

a read(2) system call on a device file using the lp driver (major number 7), 

the kernel will do nothing and return an error. // 


"Nodev" is appropriate anytime a driver does not have a particular entry point 
routine, and when calling that routine is considered an error. "Nulldev" can 
be used instead if calling a missing routine is not really erroneous. If you 
want to specify "nulldev" instead of "nodev" for particular entries in the 
cdevsw or bdevsw tables, you must edit conf.c by hand after "config" has 
finished executing. 


The C_ALLCLOSES flag can be specified via "mask". If it is not specified, a 
zero appears in that slot. If the C_NODELAY flag is desired, it must be 
manually added after "config" is finished executing. 


Modify Config.mk 


The name of the new driver’s object file must be added to the makefile created 
by "config". The object file name must be added to the HP-UX dependencies 
line and to the line containing the linker command string. The exact 
placement is shown below (the new driver’s object file is represented by 
"MYDRIVER.0o") : 


hp-ux: conf.o MYDRIVER.o 

rm -f£f hp-ux 

ar x /etc/conf/libkreq.a locore.o vers.o name.o funcentry.o 

@echo ’Loading hp-ux...’ 

$(LD) -m -n -o hp-ux -e _start -x \ 
locore.o vers.o conf.o name.o funcentry.o MYDRIVER.o \ 
$(LIBS1) $ (LIBS) 

rm -£ locore.o vers.o name.o funcentry.o 

chmod 755 hp-ux 


Execute Make | 


Now execute "make" with 

make -f config.mk 
This will compile the new conf.c file and link it with the various kernel 
libraries to produce a new HP-UX kernel. The new kernel is called "hp-ux", 
and is created in your current directory (usually /etc/conf) . 


Install the new kernel with: 


mv /hp-ux /SYSBCKUP 
mv hp-ux /hp-ux 


and then reboot the system. 
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NAME 
Gisked - interactive disk editor for HFS 


SYNOPSIS 
disked [-w] [-b <#> ] <special-file> 


DESCRIPTION 
Disked is an interactive disk editor that examines and 
modifies an HFS file system. It operates on either a 
character or block device associated with a file system. 
The file system should be unmounted while disked is being 
run on the file system. 


Disked reads commands from standard input and writes to 
either standard output or standard error. Although it was 
designed to be run interactively it can be used in batch 
mode by redirecting standard input. Most of the commands 
read data from disk into a buffer maintained by disked. 
Each command which reads from disk will overwrite this 
buffer. 


Disked normally opens special-file read-only. If the w 
option is specified then special-file is opened for reading 
and writing. Only by setting the w option is it possible 
for the user to damage the file system. 


If,b option is specified, disked will use the specified 
alternate superblock instead of the Pray superblock to 
interpret the file system. 


Disked maintains two buffers called the browser and edit 
buffers. At any point in time only one of these two buffers 
is considered the current buffer. The x command can be used 
to switch the current buffer from the browser buffer to the 
edit buffer and vice-versa. The only significant difference 
between these two buffers is that it is possible to modify 
the disk when using the edit buffer. Disked initially sets 
the current buffer to the browser buffer. For more 
information see the section on Buffer Commands. 


The output of most of the commands can be redirected using 
the disked operators ">", ">>", and salar The ">" symbol is 
used to redirect the output of an individual command to a 
file. The ">>" symbol provides the same functionality 
except that the output is appended. The "|" symbol is used 
to pipe the output of an individual command to any Unix 
command. For example, if the user wanted to redirect the 
output of the s command (display primary super-block) to a 


file called "foo". The following command would work: 
s > foo 
Hewlett-Packard Company se Apr 26, 1989 
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The following is a detailed list of commands: 


General Commands: 


command short description 


b <n> <c> display <c> bytes starting from byte <n> 
f <n> <c> display <c> bytes starting from fragment <n> 
r <n> <c> display <c> bytes starting from sector <n> 


These commands are used to display data. The user is given 
the option of specifying a byte address (b command), a 
fragment number (f command) , or a raw disk sector (r command 
- note: a raw disk sector is in terms of DEV_BSIZE units.). 
The count argument <c> is optional for the f and r commands, 
and defaults to the fragment size or to DEV BSIZE bytes, 
repsectively. Each command displays data in the same format. 
The format is a byte address counter followed by a sequence 
of numbers and the character representation of those bytes. 
With the default settings, each number represents 4 bytes 
and is displayed in hex. The counter is initially displayed 
in decimal. The default values are changed by setting the 
variables wordsize, displayin and countin (see User settable 
variables below). All of these commands will allow the user 
to display from 1 to MAXBSIZE worth of data. 


command short description 
i <n> display inode 


p <path> display inode 
a <n> <c> display <c> bytes of directory entries 
starting from fragment <n> 


These commands allow the user to traverse the directory 
tree. The i command can be used to display the contents of 
the specified inode. The root inode of an HFS file system 
is inode 2. The Pp command can be used to display the 
contents of the inode represented by <path>. If <path> is a 
relative pathname (does not begin with a ’/’), it will be 
interpreted as though the file system were mounted as the 
root file system and the current working directory were the 
root directory. An absolute pathname will be interpreted 
first as if the file system were mounted at the current or 
last mount point of a larger file hierarchy (using the 

fs mnt field of the superblock); failing this, <path> will 
be interpreted as though the file system were mounted as the 
root file system. 
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The d command is useful for displaying the data blocks of a 
directory inode as directory entries. Because data block 
addresses in the inode are really fragment numbers, this 
command (like the f command ) takes an optional count 
argument <c>. If <c> is not specified, it defaults to the 
size of a fragment. 


command short description 
q exit the program 


Allow normal termination of the program. If the edit buffer 
has been modified the g command will not allow the user to 
exit disked (see Q command). 


Buffer Commands: 


command short description 


x switch current buffer 
xX switch meaning of browser and edit buffers 


The x command is used to switch which buffer is the current 
buffer. When disked is first invoked the current buffer is 
the browser buffer. To edit the disk the user must change 
the current buffer to be the edit buffer. Then the user can 
read the data into the edit buffer and modify it. It is 
then possible to leave the changes in the edit buffer and 
switch buffers to the browser buffer. The user can then 
search through the disk without losing the changes. When 
the user wants to write the changes out, the user can switch 
back to the edit buffer, and use the W Command to write the 
data to disk. 


The X command is similar to the x command except that xX 

swaps the meaning of the browser and edit buffers such that 
the current browser buffer, along with its contents, becomes 
the edit buffer and vice versa. This makes it convenient to 
modify data already in the browser buffer without having to 
switch buffers and read in the same data to the edit buffer. 


Modification Commands (edit buffer only): 


command short description 
m <off>[:<rep>] <arglist> modify buffer 


m <start>[-<stop>] <arglist> modify buffer 
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WwW write modified buffer 


The m command slows the user to modify the current buffer 
(which must be the edit buffer) at buffer offset <off> to be 
<arglist>. <arglist> is a list of numbers or characters 
separated by one or more blanks. If a rep is specified then 
the arglist will be repeated that many times. Off may be 
specified as either a number or as an offset into a known 
structure (for a list of known offsets type h offsets). 
Alternatively, the user may specify a range within the 
buffer to be modified. Each term in the arglist is put into 
a different word. Each word represents 1, 2 or 4 bytes 
depending on the value of wordsize. The only legal values 
for wordsize are 1, 2 or 4. The terms in the arglist will 
be padded so that each term completely fills one wordsize 
unit. 


The W command is used to write the modified buffer to disk. 
Note: Two ways exist to undo changes made to the current 
buffer. The first is to read data into the current buffer. 
This can. be done with almost any of the commands. The 
second is to abort the program using the Q command. 

command short description 

Q abort program | 


Abort the program even if the edit buffer has been modified. 
All changes are ignored and the program is terminated. 


Internal Data Structure Commands: 


command | short description 


s [{s][r] display primary super-block 
s <n> [r] display redundant super-block <n> 


These commands are used to display either the primary 
super-block or the redundant super-block associated with 
each cylinder group. Included in each super-block is a 
rotational table. The r option is used to to display this 
table. In addition, the first n blocks of data space 
contain summary information. The s option can be used to 
display this while displaying the primary super-block. 


command short description 
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c <n> display cylinder group <n> 


This command is used to display the contents of any cylinder 
group. 


Use of expressions: — 


Many disked commands expect one or more numbers as 
arguments. If a command expects a number then the number 
can always be replaced with an expression. An expression is 
either an integer or a parenthesized expression containing 
one or more of the following arithmetic operators: |, &, *, 
/, +, 7. Further, an expression can contain any number of 
macros. Disked maintains a list of macros which can be 
invoked (type - h macros). As an example suppose the user 
wanted to display the cylinder group associated with a 
particular inode. One mechanism would be to use knowledge 
of how a disk is laid out and calculate the number by hand. 
The preferable method is to use the c command passing as an 
argument itog(<inode number>). 


Free List Manipulation 


command short description 
w > <file> write current buffer to <file> 


w >> <file> append current buffer to <file> 


With these two commands it is possible to walk through the 
free lists and recover lost data. 


example: 


In the following manner it is possible to read the free data 
blocks of one unmounted file system and write the data 
blocks to a file on a mounted file system. The c command 
can be used to obtain a list of free fragments in each 
cylinder group. With this information the f command can be 
used to read the free fragment into the current buffer. The 
following formula will convert a cylinder group relative 
fragment number to a file system relative fragment number 
(<fragment-number> + cgbase( <cylinder-group-number>) ). 

Once the data has been read into the current buffer, it can 
be written to any file on a mounted file system with the w 
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command. 


Extended commands 


command short description 

copyi <inode number> display data for inode 
map Gisplay a map of this disk 
tell <fragment> describe fragment 


bgrep "string" <b> <e> 


Copyi takes as input an inode number and displays the data 
blocks associated with it. It is very important that the 
user ensure that the specified inode is valid. The size and 
blocks fields in the inode must be correct or disked might 
not be able to display the data blocks. In addition, it is 
very important that checking not be turned off when this 
command is executed (see User settable variables). 


Map is used to display a fragment map of all fragments on 
the disk. 


Tell takes as input a fragment number and provides 
information about the specified fragment. 


. Bgrep searches for the specified string starting from 
fragment <b> and until fragment <e> and displays the 
fragment number of any fragment that contains this string. 
If <b> and <e> are not specified, then the search defaults 
to the whole disc. The string must be enclosed in double 
quotes and may contain C style escape characters and grep(1) 
style regular expressions. 


User settable variables: 


command short description 

set <variable> <value> assign <value> to <variable> 
This command is used to set any one of a number of different 
global variables. What follows is a list of variables and 


their possible values ang then a description of what each 
variable does: 


variable possible values (default values are in bold) 
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check (on, off) 
countin (octal, hex, decimal) 
displayin (octal, hex, decimal) 
display (on, off) 
init (on, off) 
wordsize (1, 2, or 4) 

check 


This variable controls whether or not certain error 
checks are performed by disked. Disked goes to great 
lengths to prevent the user from damaging the file 
system. Turning this variable off will prevent disked 
from performing these checks. This should obviously be 
done only with great care if disked is being used with 
the w option. 


countin 
This variable determines the radix in which the counter 
is displayed for the b, f, and r commands. 


displayin 
This variable determines the radix in which data is 
displayed for output (with the b, f£, and r commands). 


display | 

This variable controls whether or not the b, f or r 
commands will display the data when it is read in. It 
is useful to unset this variable when copying a known 
set of free blocks from the device to a file on another 
disk. 


init This variable controls whether or not the edit and 
browser buffers are re-initialized when a new disk is 
Opened (see n command). By unsetting this variable it 


is possible to copy at most MAXBSIZE worth of data from 
one disk to another. 


wordsize 
This variable controls the primary wordsize (number of 
bytes in a word) for the program. On output, it 
affects the amount of data to be displayed at any point 
in time. On input, it will control the amount of data 
overwritten for each argument in the arglist of the m 
command. 


_ Miscellaneous commands: 


command short description 
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h <topic> provide on-line help 

? <topic> provide on-line help 

h help list topics available for help 

B Gisplay current buffer as data 

C display current buffer as cylinder 

D Gisplay current buffer as director 

F display current buffer as data 

I display current buffer as inodes 

R display current buffer as data 

S display current buffer as super-bl 

! <command> execute monitor command 

n [-w] [-b <#> ] <special-file> restart program using <special-fil 
and specified options 

command short description 


= <n> display number 
This command takes as input an expression and displays the 
value of that expression in hex, octal and decimal. 


command 


short description 


$<a-z> = expr assign a value to a local variable 
This command assigns the expression to a local variable. 
There are 26 local variables $a - $z. Once a local variable 
has a value it can be used in any expression. To display 
the value of a local variable use the = command. 


In addition to the 26 local variables, disked supports two 
local variables called $size and S$address. These variables 
are the size and address of the current buffer. They may be 
used in any expression where a local variable is used. This 
enables the user to reference the size and address of the 
current buffer, without typing in the actual numbers. 
Further, if the current buffer is the edit buffer then the 
user can change the values of $size and S$address. This has 
the effect of changing where disked believes the data 


resides. By changing Saddress and then writing the edit 
buffer out, the user can move data from one place to another 
on the disk. 

example: 


The following example display the contents of the n-th 
cylinder group; where n is (0x314 + 12) / 013. 
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Monday Afternoon Labs 


0. If you have NOT used "monitor" much, run it and take a look at 
each of the screens of information. Use the online help facility. What 
things does monitor(1im) tell you that you can’t (yet) make use of? 


1. Using the template provided (ppt.c), print out the values of at 
least 5 kernel parameters. Verify 2-3 of them with monitor(im). If you 
want ideas on what to print, look at space.h or monitor’s C screen. 


2. Look through the "pm" and "misc" directories in the examples 
archive I gave you. Are there useful functions (or whole programs) ? 


3. Start work on your version of monitor, focusing on process stuff. 
Consider printing (among other things) 


- the process table (like ps does) 
- the proc table entry and u area for a given process 
- relevant kernel parameters 
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Tuesday Afternoon Labs 


1. Change the major number of some driver in /etc/master and rebuild 
your kernel. Then make a corresponding device file and reboot. Change 
something that 1) you can verify and 2) won’t kill your machine if you 
mess up. A good candidate would be character-mode SCSI/CS80 (whichever 
one your disk is). 


2. Install the ramdisk driver on the system and add code to print out the 
the size and 1k block address whenever a block is read or written (there is 
a printf() in the kernel just like there is in libc for user programs). 

You will probably need to replace the one that is already there 

(use "ar t" to figure out which library it is in). 


3. Reconfigure your kernel and look at the conf.c that gets generated. 
Which parts of it come from dfile? Which come from /etc/master? 


4. Force your system to panic and interpret the resulting stack trace. 
(misc/th_init is helpful here... :-)) 


5. Take a look at the supplied pseudo-driver called "pdisk". How does 
it compare to the pty drivers (the things that enable telnet/X11/script 
to work) ? 
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*** Be sure to look at the examples in the "fs" directory before doing’ 
these labs; also, note that many of them are easier on a ramdisk *** 


0. Write a program to hunt for superblocks on a disk. 


1. Translate a pathname to an i-number using adb(1), fsdb(im), disked(1m), 
or a C program you write. 


2. Modify "myls.c" to be something along the lines of "myll.c"; in other 
words, get the inode for each file and print things like the size, owner 
UID, etc. 


3. Use the ramdisk driver (or pdisk driver/server) to learn about the 
filesystem’s layout and "habits". How is the filesystem affected by 
fs_async? 


4. Mess up the disk using disked(1im) or some other command (You needn’t 
get too violent - how about dd(1)ing over the 1st 16K?) Then fix it 
using fsck(1m), disked(1m), or whatever you want (dd(1)ing from another 
disk is strictly an option of last resort :-)) 


kkKKK OR xkkKKK 


Write a version of cat(1) that uses only a disk device file. 


Diskless 


1. Cluster your system with another, and look closely at what monitor (1m) 
will tell you about both machines. 


2. Locally mount a ramdisk, and make it so that noone else in the 
class can access the stuff down under the mount point. This is not 
tricky/hard/etc :-) 
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See what monitor, iostat(1), and vmstat(1) will tell you about the 
te of the VM system. How does their output change if you runa 


2A Write a program that will summarize swap space usage by looking 
“at swaptab[], swapspc_max, and swapspc_cnt. It should produce output 
something like this: 

there is a total of XXX MB on the system 

YYY MB is free 

ZZZ MB is allocated 
AAA MB is reserved but not yet allocated 
/ You might want to enhance it to summarize diskless client usage as 
' well, i.e. . 
BBB MB has been allocated to <name of client 1> 
CCC MB has been allocated to <name of client 2>... 
\ Note that you do not need to walk through each swaptab[]’s swapmap array. 


N 


i 


What had to change in "top" for it to work in 8.0? Change it 
hat it sorts by size instead of CPU usage (i.e. have it print 
the|10 (or whatever) *biggest* programs, rather than the 10 that are 
ing the most CPU time). 
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Friday Labs 


0. Shut down the system and reboot it, watching carefully to see what 
gets printed out. What is the last line printed by the kernel? What 
is the first line printed by init (1m)? 


1. Finish/clean up your labs, and see if there are things in monitor 
that you recognize now that didn’t make sense earlier. 


2. Give your instructor a copy of your monitor and your filesystem 
programs. Please put them in a {shell,cpio,tar} archive. Thanks! 
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1 #include <stdio.h> 

2 #include <sys/param.h> 

3 #include <fcntl.h> 

4 #include <sys/user.h> 

5 #include <sys/proc.h> 

6 . 

7 /[* 

8 * Example of reading /dev/kmem to get at kernel data 

9 * structures. Note that this is NON-PORTABLE and 

10 * UNSUPPORTED - it may break with future releases of 

11 * HP-UX. It’s fun, though :-) 

12 * 

13 = 

14 * first we declare a data structure that will be passed to nlist (3); 
15 * note that we are only filling in the first member of each structure 
16 * in the array, and that we end with a null member 

17 * / 

18 

19 struct nlist nl[] = { /* setup for calls to nlist(3) */ 
20 #ifdef hp9000s800 
21 { "nproc" }, /* # entries in process table */ 
22 { "proc" }, /* pointer to process table */ 
23 #else 

24 { "_nproc" }, /* # entries in process table */ 

25 { "_proc" }, /* pointer to process table */ 

26 #endif 

27 { wn } 
28 }; 
29 
30 #define C_NPROC 0 /* indices into the above array */° 
31 #define C_PROC 1 
32 
33 int kmem; /* file descriptor for kernel mem */ 
34 
35 
36 main() 
37 { 
38 startup () ; 
39 walk_table() ; 
40 exit (0) ; 
41 } 
42 
43 
44 startup () /* read symbol table & open kernel memory */ 
45 
46 if (nlist("/hp-ux", nl) < 0) { 
47 perror ("nlist (3) ") ; /* can’t get symbol table */ 
48 exit (1) ; 
49 } 

50 

51 if ((kmem = open("/dev/kmem", O_RDONLY)) < 0) { 

52 perror ("open (2)") ; /* can’t open kernel mem */ 
53 exit (1); 

54 } 

55 } 
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57 

58 

59 

60 

61 walk_table() /* step through the process table */ 

62 { 

63 int i, nproc; 

64 long pt_addr; 

65 struct proc *proc_table, *p; 

66 

67 

68 /* 

69 * first go get the value of nproc from /dev/kmem, using 

70 * the address nlist(3) returned to us 

71 * / 

72 lseek (kmem, nl[C_NPROC] .n_value, 0); 

73 read(kmem, &nproc, sizeof nproc) ; 

74 proc_table = (struct proc *) calloc(nproc, sizeof (struct proc) ) ; 

75 

76 /* 

77 * now get the *address* of the proc table, seek there, 

78 = and get the real thing; this is because proc is a 

79 * pointer rather than a simple variable 

80 */ 

81 lseek(kmem, ni[C_PROC] .n_value, 0); 

82 read(kmem, &pt_addr, sizeof pt_addr) ; 

83 lseek(kmem, pt_addr, 0); 

84 if ((i = read(kmem, proc_table, sizeof(struct proc)*nproc)) < 0) 

85 perror ("read proc_table") ; 

86 close (kmem) ; 

87 exit (1); 

88 } 

89 

90 /* 

91 * we have the proc table; get in a loop and step through 

92 * the whole thing, printing a line for each slot that 

93 * is being used 

94 * / 

95 

96 p = proc_table; 

97 

98 for (i = 0; i < nproc; i++) { 

99 if (p->p_stat) /* if entry in use */ 
100 printf ("pid, pgrp, uid, ruid are td %d td %d\n", 
101 p->p_pid, p->p_pgrp, p->p_uid, p->p_suid); 
102 Ptt; 

103 } 

104 

105 close (kmem) ; 
106 } 

107 

108 


A Quick Introduction to adb(1) 


When in the course of human events it becomes necessary to patch 
a kernel or examine it, there are very few commands that will do 
the job. One possibility is adb(1), a general-purpose debugger 
that is capable of doing most anything. It is hard to use, but 
sometimes it’s the only thing available.... 


If you need to use adb(1), here are some annotated examples. Note that 
adb(1) really only knows about executable files and core files; since 
/hnp-ux is an executable and /dev/kmem is kernel memory (which has basically 
the same format as a core file), we can use it to work on the kernel. 

The "# " in each example was printed by the shell; everything else left of 
the arrows below was typed in by the intrepid hacker :-) 


# adb /hp-ux 

dfile data?s <--- print variable "dfile_ data" as a string 
from /hp-ux (note the "?") 

192327101 <--- disassemble; print 10 instructions 


starting at address 19232 
# adb -w /hp-ux /dev/kmem 


fs _ async/D <--- print variable "fs_async" as an integer 
from /dev/kmem (note the "/") 
/W 0 <--- set it to 0 (turn it off) in /dev/kmem 


Note that using "/" will cause adb(1) to work with the "core" file (/dev/kmem) 
and that this will either take effect immediately (for a simple variable) 
or not work at all (for something like nproc which sizes a data structure). 


Using "?" will direct adb(i) to the "a.out" (/hp-ux), which won’t take 
effect until you reboot (which may be what you want, and which is your only 
choice if you are changing the size of a table in the kernel). 


One last thing: adb(1) is, uh, somewhat lacking in its user interface :-) 
It is *very* picky about syntax, case, etc; in the string "fs_async/D" 
above, it really does have to be a capital "D". To get out of the 
program, use either "$q" or the old standby, "<ctrl-d>". 
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A Thousand Words Worth :-) 
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